ReplayGain 1.0 specification

From Hydrogenaudio Knowledgebase
Revision as of 02:51, 28 December 2010 by Notat (talk | contribs) (ID3 tag standard status)

Not all CDs sound equally loud. The perceived loudness of mp3s is even more variable. Whilst different musical moods require that some tracks should sound louder than others, the loudness of a given CD has more to do with the year of issue or the whim of the producer than the intended emotional effect. If we add to this chaos the inconsistent quality of mp3 encoding, it's no wonder that a random play through your music collection can have you leaping for the volume control every other track.

There is a remarkably simple solution to this annoyance, and that is to store the required replay gain for each track within the track. This concept is called "MetaData" – data about data. It's already possible to store the title, artist, and CD track number within an mp3 file using the ID3 standard.

However, there is no consistent standard by which to define the appropriate replay gain which mp3 encoders and players agree on, and no automatic way to set the volume adjustment for each track – until now.

The Replay Gain proposal sets out a simple way of calculating and representing the ideal replay gain for every track and album.

Calculation

Equal Loudness Filter

The human ear does not perceive sounds of all frequencies as having equal loudness. For example, a full scale sine wave at 1 kHz sounds much louder than a full scale sine wave at 100 Hz, even though the two have identical energy. To account for this, the signal is filtered by an inverted approximation to the equal loudness curves (sometimes referred to as Fletcher-Munson curves).

Equal loudness curves

Figure 1: Equal loudness contours

Figure 1 shows the Equal Loudness Contours, as measured by Robinson and Dadson, 1956. The original measurements were carried out by Fletcher and Munson in 1933, and the curve often carries their name.

The lines represent the sound pressure required for a test tone of any frequency to sound as loud as a test tone of 1 kHz. If every frequency sounded equally loud, then this graph would just be a series of horizontal lines. As it isn't, a filter is required to simulate this characteristic.


Required equal loudness filter

Figure 2: Loudness contours inverse response

Where the lines curve upwards, this means that we are less sensitive to sounds of that frequency. Hence, the filter must attenuate (reduce) sounds of that frequency. The ideal filter will be the inverse of the above graphs. As we don't know the playback level the listener will choose, and don't want to use a different filter for sounds of differing loudness, a representative average of the above curves will is chosen as the target filter. The desired filter response is shown in Figure 2.


Design of the equal loudness filter

Figure 3: Target response (blue) and "yulewalk" filter response (magenta)
Figure 4: Target response (blue), high-pass response (green) and composite response (red)

MATLAB offers several functions to design FIR and IIR filters to match arbitrary amplitude responses. Feeding the target response into yulewalk.m, and requesting a 2x10 coefficient IIR filter gives the following response:

At higher frequencies, this filter is an excellent approximation to our target. However, it lower frequencies, it doesn't even come close. Increasing the number of coefficients does not cause the yulewalk function to perform significantly better.

One solution is to cascade the yulewalk filter with a 2nd order Butterworth high pass filter, with a high pass frequency of 150 Hz. The resulting combined response (Figure 4) is close to our target response, and is used by Replay Level.


RMS Energy Calculation

Next, the energy during each moment of the signal is determined by calculating the Root Mean Square of the waveform every 50ms. The block length of 50ms was chosen after studying the effect of values between 25ms and 1s. 25ms was too short to accurately reflect the perceived loudness of some sounds. Beyond 50ms there was little change (after statistical processing). For this reason, 50ms was chosen.

It's easy to calculate the RMS energy over an entire audio file. Unfortunately, this value doesn't give a good indication of the perceived loudness of a signal. It's closer than that given by the peak amplitude, but it's still not good enough. For this reason, we have to calculate the RMS energy on a moment by moment basis, then do something useful with all that data.

General concept

The signal is chopped into 50ms long blocks. Then, for each block:

  1. Every sample value is squared (multiplied by itself).
  2. The mean average is taken.
  3. The square root of the average is calculated.

If you read those steps backwards, it's should be clear why the process called Root Mean Square (RMS) averaging.

Stereo files

The only difficulty lies in what to do with stereo files. We could sum them to mono before calculating the RMS energy, but then any out-of-phase components (having the opposite signal on each channel) would cancel out to zero (i.e. silence). That's not how we perceive them, so it's not a good solution.

The alternative is to calculate two RMS values (once for each channel) and then add them. Unfortunately a Linear addition still doesn't give the same effect as our ears.

We get the right answer if we add the means of the channel-signals before calculating the square root. In mixing pan-pot terms, we're using "equal power" rather than "equal voltage". If we also assume that any mono (single channel) signal will always be replayed over two loudspeakers, we can treat a mono signal as a pair of identical stereo signals. Hence a mono signal gives (a+a)/2 (i.e. a), while a stereo signal gives (a+b)/2, where a and b are the mean squared values for each channel. After this, we carry out the square root and conversion to dB.

Statistical Processing

Figure 5: Histogram of speech
Figure 6: Histogram of classical music
Figure 7: Histogram of pop music

Where the average energy level of a signal varies with time, the louder moments contribute most to our perception of overall loudness. For example, in human speech, over half the time is silence, but this does not affect the perceived loudness of the talker at all! For this reason, the RMS values are sorted into numerical order, and the value 5% down the list is chosen to represent the overall perceived loudness of the signal.

Having calculated RMS signal levels every 50ms through the file, a single value must be calculated to represent the perceived loudness of the entire file. The histograms in Figures 4, 5 and 6 show how many times each RMS value occurred in each file.

A good method to determine the overall perceived loudness is to sort the RMS energy values into numerical order, and then pick a value near the top of the list. For highly compressed pop music (e.g. Figure 7, where there are many values near the top), the choice makes little difference. For speech and classical music, the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is around 95%, so this value is used by Replay Gain.


Calibration with reference level

A suitable average replay level is 83 dB SPL. A calibration relating the energy of a digital signal to the real world replay level has been defined by SMPTE.[1] Using this calibration, we subtract the current signal from the desired (calibrated) level to give the difference. We store this difference in the audio file.

Finding a standard

Having calculated a representative RMS energy value for the audio file, we now need to reference this to a real world sound pressure level. The audio industry doesn't have any standard for listening level, but the movie industry has worked to an 83 dB standard for years.[2]

What the standard actually states is that a single channel pink noise signal, with an RMS energy level of -20 dB relative to a full scale sinusoid should be reproduced at 83 dB SPL (measured using a C-weighted, slow averaging SPL meter). In simple terms, this means that everyone can set their volume control to the same (known, calibrated) gain.

An ideal world...

If the mastering engineer sets the levels on a CD using that calibrated volume control setting, that CD will sound best at that volume. If all CDs were mastered in such a way, they'd all sound best at that volume. If you (as a listener) didn't want to listen at that particular volume setting, you could always turn it down, but all CDs would still sound equally "turned down" at your preferred setting. You wouldn't have to change the volume setting between discs.

Reality check! We know CDs aren't made like this. There is NO audio standard replay level. So, here's the clever bit - here's the whole point of this website...

Fixing a non-ideal world

We know the level should average around 83 dB SPL, and we know a -20 dB pink noise signal will give 83 dB SPL in a calibrated system. So, we send the pink noise signal through the ReplayGain program, and store the result (let's call it ref_Vrms). For every CD we process, the difference between the calculated value for that CD and ref_Vrms tells you how much you need to scale the signal in order to make it average 83 dB.

The actual process is quicker to do than to say!

One complication

The system calibration uses a single channel of pink noise (reproduced through a single loudspeaker). You then play music through both loudspeakers. So, though we use 1 channel of pink noise to calibrate the system gain, the ideal level of the music is actually the loudness when both speakers are in use. So, in ReplayGain, we calibrate to 2 channels of pink noise, because that's how loud we'd like the music to sound. In reality, we just have a monophonic pink noise wavefile, and ReplayGain automatically assumes you're playing it through both speakers, as it would any monophonic file.

Storing the Replay Gain

Under some listening conditions, it's useful to have every track sound equally loud. However, sometimes we want to leave the interntional loudness differences between tracks in place, whilst still correcting for unmusical and annoying changes in loudness between discs.

The Replay Gain suggests that two different gain adjustments should be stored in the file header, as follows.

Track Replay Gain adjustment

This will make all tracks sound equally loud. If the ReplayGain is calculated on a track-by-track basis (i.e. an individual ReplayGain calculation is carried out for each track), this will be the result.

Album Replay Gain adjustment

The problem with the Track setting is that tracks which should be quiet will be brought up to the level of all the rest. For casual listening, or in a noisy background, this can be a good thing. For serious listening, it would be a nuisance. You don't want a solo flute track blasting at the same loudness as Iron Maiden!

To solve this problem, the Album setting represents the ideal listening gain for each track. ReplayGain can have a good guess at this too, by reading the entire CD, and calculating a single gain adjustment for the whole disc. This works because quiet tracks then stay quiter than the rest, since the gain won't be changed for each track. It still solves the basic problem (annoying, unwanted level differences between discs) because quiet or loud discs are still adjusted overall - so the pop CD that's 20 dB louder than the classical CD will be brought into line.

Replay gain data format

The calibration level of 83 dB can be added to the difference from the previous calculation, to yield the actual Replay Gain. NOTE: we store the differential, NOT the actual Replay Gain.

What to store

Three values must be stored.

  1. Peak signal amplitude
  2. Track = Replay Gain adjustment required to make all tracks equal loudness
  3. Album = Replay Gain adjustment required to give ideal listening loudness

If calculated on a track-by-track basis, ReplayGain yields (2). If calculated on a disc-by-disc basis, ReplayGain will usually yield (3).

To allow for future expansion: If more than three values are stored, players should ignore those they do not recognise, but process those that they do. If additional Replay Gain adjustments other than Track and Album are stored, they should come after Track and Album. The Peak Amplitude must always occupy the first 4 bytes of the Replay Gain header frame. The three values listed above (or at least fields to hold the three values, should the values themselves be unknown) are required in all Replay Gain headers.

Range

The replay gain adjustment must be between -51.0 dB and +51.0 dB. Values outside this range must be limited to be within the range, though they are certainly in error, and should probably be re-calculated, or stored as "not set". For example, trying to cause a silent 24-bit file to play at 83 dB will yield a replay gain adjustment of +57 dB.

In practice, adjustment values from -23 dB to +17 dB are the likely extremes, and values from -18 dB to +2 dB are more usual.

Bit format

Each Replay Gain value should be stored in a Replay Gain Adjustment field consisting of two bytes (16 bits). Here are two example Replay Gain Adjustment fields:

Track gain adjustment

0 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1
\___/ \___/ | \_______________/
  |     |   |         |        
name    |  sign       |        
code    |  bit        |        
        |             |        
   originator         |        
      code            |        
                 Replay Gain   
                  Adjustment   

Album gain adjustment

0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
\___/ \___/ | \_______________/
  |     |   |         |
name    |  sign       |
code    |  bit        |
        |             |
   originator         |
      code            |
                 Replay Gain
                  Adjustment

In the above example, the Track Gain Adjustment is -12.5 dB, and was calculated automatically. The Album Gain Adjustment is +2.0 dB, and was set by the user.

Name code
000 = not set
001 = Track Gain Adjustment
010 = Album Gain Adjustment
other = reserved for future use

If space has been reserved for the Replay Gain in the file header, but no replay gain calculation has been carried out, then all bits (including the Name code) may be zero.

For each Replay Gain Adjustment field, if the name code = 000 (not set), then players should ignore the rest of that individual field.

For each Replay Gain Adjustment field, if the name code is an unrecognised value (i.e. not 001-Track or 010-Album), then players should ignore the rest of that individual field.

If no valid Replay Gain Adjustment fields are found (i.e. all name codes are either 000 or unknown), then the player should proceed as if the file contained no Replay Gain Adjustment information (see player requirements).

Originator code
000 = Replay Gain unspecified
001 = Replay Gain pre-set by artist/producer/mastering engineer
010 = Replay Gain set by user
011 = Replay Gain determined automatically, as described in Calculating (above)
other = reserved for future use

For each Replay Gain Adjustment field, if the name code is valid, but the Originator code is 000 (Replay Gain unspecified), then the player should ignore that Replay Gain adjustment field.

For each Replay Gain Adjustment field, if the name code is valid, but the Originator code is unknown, then the player should still use the information within that Replay Gain Adjustment field. This is because, even if we are unsure as to how the adjustment was determined, any valid Replay Gain adjustment is more useful than none at all.

If no valid Replay Gain Adjustment fields are found (i.e. all originator codes are 000), then the player should proceed as if the file contained no Replay Gain Adjustment information (see player requirements).

Sign bit
0 = positive gain (boost)
1 = negative gain (attenuation)
Replay Gain Adjustment

The value, multiplied by ten, stripped of its sign (since the + or - is stored in the "sign" bit), is represented in 9 bits. e.g. -3.1 dB becomes 31 = 000011111.

Default Value

$00 $00 (0000000000000000) should be used where no Replay Gain has been calculated or set. This value will be interpreted by players in the same manner as a file without a Replay Gain field in the header (see player requirements).

The values of xxxyyy0000000000 (where xxx is any name code, and yyy is any originator code) are all valid, but indicate that the Replay Gain is to be left at 83 dB (0 dB Replay Gain Adjustment). These are not default values, and should only be used where appropriate (e.g. where the user, producer, or Replay Gain calculation has indicated that the correct Replay Gain is 83 dB).

Illegal Values

The values xxxyyy1000000000 are all illegal. If enountered, players should treat them in the same manner as $00 $00 (the default value).

The value $xx $ff is not illegal, but it would give a false synch value within an mp3 file. The problems this may cause should be investigated, and a solution (e.g. unsychronisation) sought. Maybe this is a use for negative zero?

Peak amplitude data format

Scanning the file for the peak amplitude can be a time-consuming process. Therefore, it's helpful if this single value is stored within the file header. This can be used to check if the required replay gain adjustment will cause the file to clip.

Data Format

The maximum peak amplitude (a single value) should be stored as a 32-bit floating point number, where 1=digital full scale.

Uncompressed Files

Simply store the maximum absolute sample value held in the file (on any channel). The single sample value should be converted to a 32-bit float, such that digital full scale is equivalent to a value of 1.

Compressed files

Compressed audio does not exist as a waveform until it is decoded. Unfortunately, psychoacoustic coding of a heavily limited file can lead to sample values larger than digital full scale upon decoding. However, it is likely that such values will be brought back within range after scaling by the replay level. Even so, it is necessary to store the peak value of a compressed file as a 32-bit floating-point representation, where +/-1 represent digital full scale, and values outside this range would usually clip.

Implementation

For uncompressed files, the maximum values must be found and stored. For compressed files, the files must be decoded using a fully compliant decoder that allows peak overflows (i.e. has headroom), and the maximum value stored.

Replay Gain File Format

Three values must be stored.

  1. Peak signal amplitude
  2. Track = Replay Gain adjustment required to make all tracks equal loudness
  3. Album = Replay Gain adjustment required to give ideal listening loudness

Each audio file format represents a unique situation. All audio files would benefit from the inclusion of Replay Gain information. In the following list, the links take you to a suggested format for storing the 3 values within the file. Where there is no link, I'm awaiting suggestions!

  • .ape
  • .mp3 - ID3v2, LAME VBR proposed tag specification
  • .mpc
  • .ogg
  • .wav

ID3v2

The ID3v2 standard[3] defines a "tag" which is situated before the data in an mp3 file. The original ID3 (v1) tags resided at the end of the file, and contained a few fields of information. The ID3v2 tags can contain virtually limitless amounts of information, and new "frames" within the tags may be defined.

In the language of the ID3v2 standard document, the Replay Gain tag is thus:[4]

<Header for 'Replay Gain Adjustment', ID: "RGAD">
Peak Amplitude				$xx $xx $xx $xx
Track Replay Gain Adjustment		$xx $xx
Album Replay Gain Adjustment		$xx $xx

Header consists of:

Frame ID		$52 $47 $41 $44	= "RGAD"
Size			$00 $00 $00 $08
Flags			$40 $00		(%01000000 %00000000)

In the RGAD frame, the above specified flags value states that the frame should be preserved if the ID3v2 tag is altered, but discarded if the audio data is altered.

Player requirements

Figure 8: Possible Replay Gain control panel

In practice, scalaing and pre-amp can be carried out in a single step, where each sample is multiplied by a fixed amount. The clipping prevention need only be carried out if, after the first two adjustments, the peak signal amplitude is above digital full scale.

The three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. Thus maximum signal to noise ratio is maintained in the digital signal and DAC process.

Scale audio to match Replay Gain

The Player reads the Replay Gain value, and scales the audio data as appropriate.

Reading the Replay Gain

First, the player needs to determine if the user requires Track style level normalization (all tracks same loudness), or Album style level normalization (all tracks "ideal" loudness). This option should be selectable in the Replay Gain control panel, and should default to Track.

Then the player reads the appropriate Replay Gain adjustment value from the file header, and converts it back to its original dB value. See the Replay Gain Data Format for more details.

The player also needs to read (or calculate) the Peak amplitude. This is required for Clipping prevention.

Scaling by the Replay Gain adjustment

Changing the level of an audio signal simply means multiplying each sample value by a constant value. This constant is given by:

scale=10.^(replay_gain/20);

Or, in words: ten to the power of (the replay gain divided by 20).

After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard - not the file (i.e. after Replay Gain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).

If the Replay Gain information is absent...

Simply disabling Replay Gain control for tracks without Replay Gain information would cause these tracks to be louder than the others, so bringing back the original problem!

If neither (Track or Album) Gain adjustment is set, or if the track does not contain Replay Gain information, then the player should use an average of the previous 10 Replay Gains. This represents the typical loudness of tracks in the users music collection, and is a much better estimate of the likely Replay Gain than 0 dB, or no adjustment at all.

If the file only contains one of the Replay Gain adjustments (e.g. Album) but the user has requested the other (Track), then the player should use the one that is available (in this case, Album).

Pre-amp

Most users who only play pop music will find that the level has been reduced too far for them. An selectable boost of 6 dB should be included by default, otherwise normal users may be disappointed by low output level. Knowledgeable users, or those playing classical music, will disable this. Some may even choose to decrease the level. For user friendliness, this part should be referred to as the "pre-amp".

Whilst the SMPTE calibration level we're using suggests that the average level of an audio track should be 20 dB below full scale (to leave room for peaks), some pop music is dynamically compressed to peak at 0 dB and average around -3 dB. This means that, when the Replay Gain is correctly set, the level of such tracks will be reduced by 17 dB! If users are listening to a mixture of highly compressed and not compressed tracks, then Replay Gain will make the listening experience more pleasurable, by bringing the level of the compressed tracks down into line with that of the others. However, if users are only listening to highly compressed music, then they are likely to complain that all their files are now too quiet.

To solve this problem, a Pre-amp should be incorporated into the player. This is basically just an adjustment to the scale factor we calculated on the previous page. It should default to a +6 dB boost. This means that casual users will find little change to the loudness of their compressed pop music (except that the occasional "problem" quiet track will now be clipped, compressed or not as loud as the rest), while power users and audiophiles can reduce the Pre-amp gain to enjoy all their music.

If the Pre-amp gain is set too high, peaks will be compressed (see Clipping Prevention [below]). However, this is exactly what radio stations do all the time, and many listeners like this sound.

Implementation

If enabled, simply read the user selected pre-amp gain, and scale the audio signal by the appropriate amount. For example, a +6 dB gain requires a scale of 10.^(6/20), which is approximately 2. The Replay Gain and Pre-amp scale factors can be multiplied together[5] for simplicity and ease of processing.

Clipping Prevention

The player should, by default, apply hard limiting (NOT CLIPPING) to any signal peaks that would go over full scale after the above two operations. This should be user defeatable, so that audiophile users can choose to decrease the overall level to avoid clipping, rather than limiting the signal.

Why might the signal clip?

There are 3 reasons:

  1. In coded audio (e.g. mp3 files) a file that was hard-limited to digital full scale before encoding will often be pushed over the limit by the psychoacoustic compression. A decoder with headroom can recover the over full scale signal by reducing the gain.
  2. Replay Gain will make loud dynamically compressed tracks quieter, and quiet dynamically uncompressed tracks louder. The average levels will then be similar, but the quiet tracks will actually have louder peaks. If the user pushes the pre-amp gain to maximum (which would take highly compressed pop music back to its original level), then the peaks of the (originally) quieter tracks will be pushed well over full scale.
  3. If a track has a very wide dynamic range, then even without turning up the pre-amp, the replay gain itself may instruct the player to turn the track up such that it would clip, simply because the average energy is so low, but the peak amplitude is very high. If anyone does find a recording which causes this with the pre-amp gain set at 0, please let me know!

What can we do about it?

The simple option is to let it clip! However, this isn't a good idea, as it'll sound awful. There are two solutions:

In situation 2 above, the user clearly wants all the music to sound very loud. To give them their wish, any signal which would peak above digital full scale should be hard limited at just below digital full scale. This is also useful at lower pre-amp gains, where it allows the average level of classical music to be raised to that of pop music, without distorting. This could be useful for making tapes for the car. The exact type of limiting/compression is up to the player, but something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate (for pop music at least).

The audiophile user will not want any compression or limiting on the signal. In this case the only option is to reduce the pre-amp gain (so that the scaling of the digital signal is lower than that suggested by the replay level). In order to maintain the consistency of level between tracks, the pre-amp gain should remain at this reduced level for subsequent tracks.

Implementation

If the Peak Level is stored in the header of the file, it is trivial to calculate if (following the Replay Gain adjustment and Pre-amp gain) the signal will clip at some point. If it won't, then no further action is necessary. If it will, then either the hard limiter should be enabled, or the pre-amp gain should be reduced accordingly before playing the track.

Hardware implementation

The above three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. The clipping problem can be addressed by providing adequate headroom in the analog circuitry. Bit transparency and maximum signal to noise ratio is maintained in the digital signal and DAC process.

Acknowledgements

The Replay Gain proposal was developed by David Robinson and was originally published 10 July 2001. Additional updates were published by David Robinson through 10 October 2001.

The following acknowledgement was included with the original proposal, "The algorithm to calculate an ideal replay gain has grown from my research into human hearing, with many additional ideas drawn from the work of E. Zwicker, and Brian Moore. I am currently completing my PhD at the University of Essex, and have been funded by the EPSRC."

Additionally David Robinson credited Glen Sawyer (Snelg) and Jim Casaburi (Walrus) for software contributions and Bob Katz and Matt Ashland for ideas.

Notes

  1. SMPTE RP 200:2002 Relative and Absolute Sound Pressure Levels for Motion-Picture Multichannel Sound Systems – Applicable for Analog Photographic Film Audio, Digital Photographic Film Audio and D-Cinema
  2. This number (83 dB SPL) wasn't picked at random. It represents a comfortable average listening level, determined by professionals from years of listening. That reference level of -20 dB pink noise isn't random either. It causes the calibrated average level to be 20 dB less than the peak level. In other words, it leaves 20 dB of headroom for louder than average signals. So, if CDs were mastered this way, the average level would be around -20 dBFS, leaving lots of room for the dramatic peaks which make music exciting.
  3. The ID3v2 format is explained at www.id3.org. The most useful document is the ID3v2 v2.3.0 standard. Whilst this document has been superseded by v2.4.0, the earlier document is complete (rather than an update), and in indexed HTML form. As such, it represents a better technical introduction to ID3v2.
  4. This tag specification is not part of any version of the ID3 specification but is acknowledged as an "in the wild" tag byt the ID3 standards organization.
  5. Scale factors in Decibel units are added to produce the same effect as multiplying scale factors in linear units.

References