ReplayGain 1.0 specification

From Hydrogenaudio Knowledgebase
Revision as of 02:08, 13 February 2011 by Notat (Talk | contribs)

Jump to: navigation, search

Not all recordings sound equally loud. Different musical moods require that some tracks sound louder than others, and the loudness of a given album has more to do with the year of issue or the whim of the producer than the intended emotional effect. Because of this, a random play through your music collection can have you leaping for the volume control every other track.

There is a remarkably simple solution to this annoyance, and that is to store the required replay gain for each track within the track. This concept is called "metadata" – data about data. It's already possible to store the title, artist, and CD track number within an mp3 file using the ID3 standard.

The ReplayGain specification is a standard which defines the appropriate replay gain which MP3 encoders and players agree on, and an automatic way to set the volume adjustment for each track. The specification sets out a simple way of calculating and representing the ideal replay gain for every track and album.

Loudness measurement

Loudness filter

Figure 1: Target response (blue), high-pass response (green) and composite response (red)

The human ear does not perceive sounds of all frequencies as having equal loudness. For example, a full scale sine wave at 1 kHz sounds much louder than a full scale sine wave at 100 Hz, even though the two have identical energy. To account for this, the signal is filtered by an inverted approximation to the equal loudness curves (sometimes referred to as Fletcher-Munson curves). The desired filter response derived from the equal loudness curves is shown in figure 1 (blue).

At higher frequencies a 10th order IIR filter designed by MATLAB's "yulewalk" function is an excellent approximation to our target. This is cascaded with a 2nd order Butterworth high pass filter, with a high pass frequency of 150 Hz. The resulting combined response (Figure 1 [red]) is close to our target response, and is used by ReplayGain.

Figure 2: IIR filter topology used by "yulewalk" and Butterworth filter components

The filter topology used for the components of the loudness filter is shown in figure 2. The filter coefficients for 48 and 44.1 kHz sample rates are given for the Butterworth and "yulewalk" components in tables 1 and 2 respectively. When using other sample rates, coefficients must be transformed to maintain the same filter response.

Table 1a: Butterworth filter coefficients (Fs=48 kHz)
b(0) 0.98621192462708
a(1) 1.97223372919527 b(1) -1.97242384925416
a(2) -0.97261396931306 b(2) 0.98621192462708
Table 1b: Butterworth filter coefficients (Fs=44.1 kHz)
b(0) 0.98500175787242
a(1) 1.96977855582618 b(1) -1.97000351574484
a(2) -0.97022847566350 b(2) 0.98500175787242
Table 2a: "Yulewalk" filter coefficients (Fs=48 kHz)
b(0) 0.03857599435200
a(1) 3.84664617118067 b(1) -0.02160367184185
a(2) -7.81501653005538 b(2) -0.00123395316851
a(3) 11.34170355132042 b(3) -0.00009291677959
a(4) -13.05504219327545 b(4) -0.01655260341619
a(5) 12.28759895145294 b(5) 0.02161526843274
a(6) -9.48293806319790 b(6) -0.02074045215285
a(7) 5.87257861775999 b(7) 0.00594298065125
a(8) -2.75465861874613 b(8) 0.00306428023191
a(9) 0.86984376593551 b(9) 0.00012025322027
a(10) -0.13919314567432 b(10) 0.00288463683916
Table 2b: "Yulewalk" filter coefficients (Fs=44.1 kHz)
b(0) 0.05418656406430
a(1) 3.47845948550071 b(1) -0.02911007808948
a(2) -6.36317777566148 b(2) -0.00848709379851
a(3) 8.54751527471874 b(3) -0.00851165645469
a(4) -9.47693607801280 b(4) -0.00834990904936
a(5) 8.81498681370155 b(5) 0.02245293253339
a(6) -6.85401540936998 b(6) -0.02596338512915
a(7) 4.39470996079559 b(7) 0.01624864962975
a(8) -2.19611684890774 b(8) -0.00240879051584
a(9) 0.75104302451432 b(9) 0.00674613682247
a(10) -0.13149317958808 b(10) -0.00187763777362

Input samples from the audio file to be analysed must be run in cascade manner through both of these filter components before being analysed further.

RMS level calculation

Next, the energy during each moment of the signal is determined by calculating the Root Mean Square of the filtered signal every 50ms.[1]

The signal is chopped into 50ms long blocks. Then, for each block:[2]

  1. Every sample value is squared (multiplied by itself).
  2. The mean average is taken.
  3. The square root of the average is calculated.

For stereo signals, in step 3, the mean average of all squared samples from both channels over the 50ms measurement interval is taken.[3]

Statistical processing

Where the average energy level of a signal varies with time, the louder moments contribute most to our perception of overall loudness. For example, in human speech, over half the time is silence.

A good method to determine the overall perceived loudness is to sort the RMS values into numerical order, and then pick a value near the top of the list. For highly compressed pop music (e.g. Figure 5(c), where there are many values near the top), the choice makes little difference. For speech and classical music (Figures 5(a) and 5(b) respectively), the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is 95%, so this value is used by ReplayGain.

Reference level

The audio industry doesn't have a standard for playback system calibration, but in the movie industry a calibration standard has been defined by the Society of Motion Picture and Television Engineers (SMPTE).[4] The standard states that a single channel pink noise signal with an RMS level of -20 dB relative to a full-scale sinusoid[5] should be reproduced at 83 dB SPL.[6]

ReplayGain adapts the SMPTE calibration concept for music playback. Under RG, audio is played so that it's loudness, as measured using the procedures described above, matches the loudness of a pink noise signal with an RMS level of -14 dB relative to a full-scale sinusoid,[7] also measured using the procedures described above.

In ReplayGain implementations, the reference level is described in terms of the SMPTE SPL playback level. By SMPTE definition, the 83 dB SPL reference corresponds to -20FS dB system headroom. The -14 dB headroom used by RG corresponds to an 89 dB SPL playback level on a SMPTE calibrated system and so is said to be operating with an 89 dB reference level.

SMPTE cinema calibration calls for a single channel of pink noise reproduced through a single loudspeaker. In music applications the ideal level of the music is actually the loudness when both speakers are in use. So, in ReplayGain, we calibrate to 2 channels of pink noise, because that's how loud we'd like the music to sound. In reality, we just have a monophonic pink noise wavefile, and ReplayGain automatically assumes you're playing it through both speakers, as it would any monophonic file.

Gain calculation

RG achieves loudness compensated playback by applying gain (or attenuation) dependent on the measured loudness of the recording relative to the established reference level. The gain is calculated as follows:

RG=L_{n14}-L

Where:

RG is the ReplayGain adjustment in decibels,
L_{n14} is the measured loudness of the -14 dB pink noise reference in decibels and
L is the measured loudness of the audio file.

Gain is positive if the loudness of the audio file is lower than the reference. The gain is negative (representing an attenuation) if the loudness of the audio file is higher than that of the pink noise reference. The gain stored as metadata with the audio file as described below and is used by players to adjust output volume of tracks as they are played.

Metadata

Four values must be stored.

  1. Peak track signal amplitude
  2. Peak album signal amplitude
  3. Track replay gain
  4. Album replay gain

If calculated on a track-by-track basis, loudness measurement yields track replay gain. If calculated on a disc-by-disc basis, loudness measurements yield album replay gain.

Replay gain

Under some listening conditions, it's useful to have every track sound equally loud. However, generally we want to leave the intentional loudness differences between tracks in place, yet still correct for unmusical and annoying changes in loudness between discs. ReplayGain suggests that two different gain adjustments should be stored as metadata with each sound file.

Track replay gain will make all tracks sound equally loud. If the replay gain is calculated on a track-by-track basis (i.e. an individual ReplayGain calculation is carried out for each track), this will be the result.

The problem with the track-by-track approach is that tracks which should be quiet in the context of the album on which they reside will be brought up to the level of all the rest. For casual listening, or in a noisy background, this can be a good thing. For serious listening, it does not respect the artistic intent; a tender ballad track blasting at the same loudness as a hard rock track on the same album.

Album replay gain represents the ideal listening gain for an entire album. ReplayGain reads the entire CD, and calculates a single gain adjustment for the whole disc. This works because quiet tracks then stay quieter than the rest, since the gain won't be changed for each track. It still solves the basic problem (annoying, unwanted level differences between discs) because quiet or loud discs are still adjusted overall - so the pop CD that's 20 dB louder than the classical CD will be brought into line.

Peak amplitude

Scanning the file for the peak amplitude can be a time-consuming process. Therefore, it's helpful if this single value is stored within the file header. This can be used to predict whether the required replay gain adjustment will cause clipping during playback.

The maximum peak amplitude value must be stored as a 32-bit floating point number, where 1.0 represents digital full scale. As with replay gain values, separate peak amplitude values are stored per track and per album.

For uncompressed files simply store the maximum absolute sample value held in the file on any channel and positive or negative excursion. The single sample value should be converted to a 32-bit float, such that digital full scale is equivalent to a value of 1.0.

Compressed audio does not exist as a sequence of samples until it is decoded. Unfortunately, psychoacoustic coding of a heavily limited file can lead to sample values larger than digital full scale upon decoding. It is necessary to store the peak value of a compressed file as a 32-bit floating-point representation, where 1.0 represents digital full scale, and higher values outside this range would normally result in clipping. The compressed files must be decoded using a fully compliant decoder that allows peak overflows (i.e. has headroom).

Metadata format

Each audio file format represents a unique situation. All audio files would benefit from the inclusion of Replay Gain information.

  • .3gp, .3g2 (MPEG-4 mobile)
  • .aac (Advanced Audio Coding raw format) - No metadata support (use .mp4 instead)
  • .aiff, .aif, .aifc (Apple Interchange File Format) - ID3v2 (in "ID3" IFF chunk)
  • .alac (Apple Lossless Audio Coder)
  • .ape, .apl (Monkey's Audio) - APEv2
  • .avi (Windows AV)
  • .bwf (Broadcast Wave Format) - ID3v2 (in RIFF chunk)
  • .flac (Free Lossless Audio Codec) - Vorbis comments
  • .mka (Mastroska)
  • .mov, .qt (QuickTime)
  • .mp3 (MPEG audio layer 3) - ID3v2, LAME VBR proposed tag specification
  • .mp4 also .m4a, .m4b, .m4p, m4r (MPEG-4 Part 14) - ID3v2 (in "ID32" box)
  • .mpg, .mpeg, .ps (MPEG-2)
  • .mpc (Musepack) - APEv2
  • .ogg (Ogg Vorbis) - Vorbis comments
  • .tta (True Audio) - ID3v2, APEv2
  • .wma (Windows Media audio) - Advanced Systems Format
  • .wav (Windows PCM) - No metadata support (use .bwf instead)
  • .wv (WavePak) - ID3v1, APEv2

ID3v2

The ID3v2 standard[8] defines a tag which is situated before the data in an mp3 file.[9] ID3 is used primarily with MP3 audio files but means of adapting the system to other file types have been developed.

The ID3v2 tag can contain virtually limitless amounts of information. The information is divided into frames. The preferred means of storing ReplayGain metadata is use of TXXX key/value pair frames. Two other legacy schemes for storing ReplayGain metadata exist: RGAD and RVA2. Players may choose to look for these formats if the TXXX format described is not found in the ID3v2 tag. New players may write these older formats if they wish to be backwards compatible with older players.

ReplayGain uses four TXXX frames. The header of a TXXX frame is coded as follows:

Frame ID       $54 58 58 58 ("TXXX") 
Size           $xx xx xx xx (size of frame excluding this header)
Flags          $40 $00      (discard frame if audio data is altered)

Frame data is coded as follows:

Text encoding  $00          (ISO-8859-1 encoding)
Description    <key string> $00
Value          <value string>

The four frames associated with ReplayGain metadata use the following key/value pairs

Table 3: Metadata keys and value formatting
Metadata Key Value
Track gain REPLAY_TRACK_GAIN [-]a.bb dB
Track peak REPLAY_TRACK_PEAK c.dddddd
Album gain REPLAY_ALBUM_GAIN [-]a.bb dB
Album peak REPLAY_ALBUM_PEAK c.dddddd

Gains are specified textually in decibels. Negative gains (attenuation) are prefixed with a '-'. Positive gains have no prefix. Integral portion of the gain (a) may be one or two numeric (0-9) digits. If there is no integral portion the field is '0'. The decimal portion of the gain (bb) is two numeric digits. Gains are suffixed with a space followed by 'dB'.

Peak levels are specified textually as a positive decimal. Peak level is a dimensionless quantity with 1.000000 representing full scale. The integer field (c) is typically 1 or 0. No suffix is included on this field. Six digits numeric in the decimal field (dddddd) is adequate to accurately represent peak values for 16-bit audio data.

A robust Player should be prepared to parse the following variations in either either replay gain or peak level metadata:

  • Positive gains with leading '+'
  • More or fewer significant digits than specified
  • Leading zeros or spaces in integer fields
  • Missing or malformed 'dB' suffix (e.g. no space between numeric digits and suffix, alternate capitalization)
  • Alternate capitalization of keys

Other formatting errors indicate more severe problems and should result in player ignoring data as if the frame did not exist.

Vorbis comments

A Vorbis comment[10] uses an ASCII key=value format. The four ReplayGain metadata items are stored separate comments. The keys and formatting for values is the same as specified for ID3v2. Keys and values are separated by '=' (equal character) as required by Vorbis comments.

APEv2

Editors note: insert description of ReplayGain metadata format for APEv2.

Player requirements

Figure 8: Possible Replay Gain control panel

In practice, scaling and pre-amp can be carried out in a single step, where each sample is multiplied by a fixed amount. The clipping prevention need only be carried out if, after the first two adjustments, the peak signal amplitude is above digital full scale.

The three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. Thus maximum signal to noise ratio is maintained in the digital signal and DAC process.

Scale audio to match Replay Gain

The Player reads the Replay Gain value, and scales the audio data as appropriate.

Reading the Replay Gain

First, the player needs to determine if the user requires Track style level normalization (all tracks same loudness), or Album style level normalization (all tracks "ideal" loudness). This option should be selectable in the Replay Gain control panel, and should default to Track.

Then the player reads the appropriate Replay Gain adjustment value from the file header, and converts it back to its original dB value. See the Replay Gain Data Format for more details.

The player also needs to read (or calculate) the Peak amplitude. This is required for Clipping prevention.

Scaling by the Replay Gain adjustment

Changing the level of an audio signal simply means multiplying each sample value by a constant value. This constant is given by:

scale=10.^(replay_gain/20);

Or, in words: ten to the power of (the replay gain divided by 20).

After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard - not the file (i.e. after Replay Gain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).

If the Replay Gain information is absent...

Simply disabling Replay Gain control for tracks without Replay Gain information would cause these tracks to be louder than the others, so bringing back the original problem!

If neither (Track or Album) Gain adjustment is set, or if the track does not contain Replay Gain information, then the player should use an average of the previous 10 Replay Gains. This represents the typical loudness of tracks in the users music collection, and is a much better estimate of the likely Replay Gain than 0 dB, or no adjustment at all.

If the file only contains one of the Replay Gain adjustments (e.g. Album) but the user has requested the other (Track), then the player should use the one that is available (in this case, Album).

Pre-amp

Most users who only play pop music will find that the level has been reduced too far for them. An selectable boost of 6 dB should be included by default, otherwise normal users may be disappointed by low output level. Knowledgeable users, or those playing classical music, will disable this. Some may even choose to decrease the level. For user friendliness, this part should be referred to as the "pre-amp".

Although the SMPTE calibration level we're using suggests that the average level of an audio track should be 20 dB below full scale (to leave room for peaks), some pop music is dynamically compressed to peak at 0 dB and average around -3 dB. This means that, when the Replay Gain is correctly set, the level of such tracks will be reduced by 17 dB! If users are listening to a mixture of highly compressed and not compressed tracks, then Replay Gain will make the listening experience more pleasurable, by bringing the level of the compressed tracks down into line with that of the others. However, if users are only listening to highly compressed music, then they are likely to complain that all their files are now too quiet.

To solve this problem, a pre-amp should be incorporated into the player. This is basically just an adjustment to the scale factor we calculated on the previous page. It should default to a +6 dB boost. This means that casual users will find little change to the loudness of their compressed pop music (except that the occasional "problem" quiet track will now be clipped, compressed or not as loud as the rest), while power users and audiophiles can reduce the pre-amp gain to enjoy all their music.

If the pre-amp gain is set too high, peaks will be compressed (see Clipping Prevention [below]). However, this is exactly what radio stations do all the time, and many listeners like this sound.

Implementation

If enabled, simply read the user selected pre-amp gain, and scale the audio signal by the appropriate amount. For example, a +6 dB gain requires a scale of 10.^(6/20), which is approximately 2. The Replay Gain and Pre-amp scale factors can be multiplied together[11] for simplicity and ease of processing.

Clipping Prevention

The player should, by default, apply hard limiting (NOT CLIPPING) to any signal peaks that would go over full scale after the above two operations. This should be user-defeatable, so that audiophile users can choose to decrease the overall level to avoid clipping, rather than limiting the signal.

Why might the signal clip?

There are 3 reasons:

  1. In coded audio (e.g. mp3 files) a file that was hard-limited to digital full scale before encoding will often be pushed over the limit by the psychoacoustic compression. A decoder with headroom can recover the over full scale signal by reducing the gain.
  2. Replay Gain will make loud dynamically compressed tracks quieter, and quiet dynamically uncompressed tracks louder. The average levels will then be similar, but the quiet tracks will actually have louder peaks. If the user pushes the pre-amp gain to maximum (which would take highly compressed pop music back to its original level), then the peaks of the (originally) quieter tracks will be pushed well over full scale.
  3. If a track has a very wide dynamic range, then even without turning up the pre-amp, the replay gain itself may instruct the player to turn the track up such that it would clip, simply because the average energy is so low, but the peak amplitude is very high. If anyone does find a recording which causes this with the pre-amp gain set at 0, please let me know!

What can we do about it?

The simple option is to let it clip! However, this isn't a good idea, as it'll sound awful. There are two solutions:

In situation 2 above, the user clearly wants all the music to sound very loud. To give them their wish, any signal which would peak above digital full scale should be hard limited at just below digital full scale. This is also useful at lower pre-amp gains, where it allows the average level of classical music to be raised to that of pop music, without distorting. This could be useful for making tapes for the car. The exact type of limiting/compression is up to the player, but something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate (for pop music at least).

The audiophile user will not want any compression or limiting on the signal. In this case the only option is to reduce the pre-amp gain (so that the scaling of the digital signal is lower than that suggested by the replay level). In order to maintain the consistency of level between tracks, the pre-amp gain should remain at this reduced level for subsequent tracks.

Implementation

If the Peak Level is stored in the header of the file, it is trivial to calculate if (following the Replay Gain adjustment and Pre-amp gain) the signal will clip at some point. If it won't, then no further action is necessary. If it will, then either the hard limiter should be enabled, or the pre-amp gain should be reduced accordingly before playing the track.

Hardware implementation

The above three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. The clipping problem can be addressed by providing adequate headroom in the analog circuitry. Bit transparency and maximum signal to noise ratio is maintained in the digital signal and DAC process.

Appendices

Acknowledgements

The ReplayGain proposal was developed by David Robinson and was originally published 10 July 2001. Additional updates were published by David Robinson through 10 October 2001.

The following acknowledgement was included with the original proposal, "The algorithm to calculate an ideal replay gain has grown from my research into human hearing, with many additional ideas drawn from the work of E. Zwicker, and Brian Moore. I am currently completing my PhD at the University of Essex, and have been funded by the EPSRC."

Additionally David Robinson credited Glen Sawyer (Snelg) and Jim Casaburi (Walrus) for software contributions and Bob Katz and Matt Ashland for ideas.

Notes

  1. The block length of 50ms was chosen after studying the effect of values between 25ms and 1s. 25ms was too short to accurately reflect the perceived loudness of some sounds. Beyond 50ms there was little change (after statistical processing). For this reason, 50ms was chosen.
  2. If you read those steps backwards, it's should be clear why the process called Root Mean Square (RMS) averaging.
  3. We could sum channels of a stereo signal to mono before calculating the RMS level, but then any out-of-phase components (having the opposite signal on each channel) would cancel out to zero (i.e. silence). That's not how we perceive them, so it's not a good solution.
  4. SMPTE RP 200:2002 - Relative and Absolute Sound Pressure Levels for Motion-Picture Multichannel Sound Systems – Applicable for Analog Photographic Film Audio, Digital Photographic Film Audio and D-Cinema
  5. "dB relative to a full-scale sinusoid" is preferred over "dBFS" as a unit of measure in this specification because there is some ambiguity whether the reference for dBFS is a full-scale square wave (peak reference) or a sine wave (RMS reference).
  6. Measured using a C-weighted, slow averaging SPL meter.
  7. The initial ReplayGain proposal used the same -20 dB reference used by SMPTE. The reference was raised to -14 dB early on in ReplayGain development. This reference is used in all current ReplayGain implementations.
  8. The ID3v2 format is explained at www.id3.org. The most useful document is the ID3v2 v2.3.0 standard. Although this document has been superseded by v2.4.0, the earlier document is complete (rather than an update), and in indexed HTML form. As such, it represents a better technical introduction to ID3v2.
  9. The original ID3 (v1) tags resided at the end of the file, and contained a few fields of information. The ID3v1 tag is not extensible and therefore cannot support ReplayGain metadata.
  10. Vorbis comment metadata format. ReplayGain metadata is documented on the Xiph Wiki.
  11. Scale factors in Decibel units are added to produce the same effect as multiplying scale factors in linear units.

References