ReplayGain 1.0 specification: Difference between revisions

From Hydrogenaudio Knowledgebase
(minor formatting)
(link to ReplayGain 2.0 specification)
 
(76 intermediate revisions by 10 users not shown)
Line 1: Line 1:
Not all recordings sound equally loud. Whilst different musical moods require that some tracks should sound louder than others, the loudness of a given album has more to do with the year of issue or the whim of the producer than the intended emotional effect. Because of this a random play through your music collection can have you leaping for the volume control every other track.
Although music is encoded to a digital format with a clearly defined maximum peak amplitude, and although most recordings are normalized to utilize this peak amplitude, not all recordings sound equally loud. This is because once this peak amplitude is reached, perceived loudness can be further increased through signal-processing techniques such as dynamic range compression and equalization.<ref>Source: Wikipedia - [http://en.wikipedia.org/wiki/Loudness_war Loudness war]</ref> Therefore, the loudness of a given album has more to do with the year of issue or the whim of the producer than the intended emotional effect. Because of this, a random play through a music collection can have one leaping for the volume control every other track.


There is a remarkably simple solution to this annoyance, and that is to store the required replay gain for each track within the track. This concept is called "MetaData" – data about data. It's already possible to store the title, artist, and CD track number within an mp3 file using the ID3 standard.
There is a solution to this annoyance: within each audio file, information can be stored about what volume change would be required to play each track or album at a standard loudness, and players can use this "replay gain" information to automatically nudge the volume up or down as required.


The Replay Gain specification is a standard which defines the appropriate replay gain which MP3 encoders and players agree on, and an automatic way to set the volume adjustment for each track.
The ReplayGain specification is a standard which defines an appropriate reference level, explains a way of calculating and representing the ideal replay gain for a given track or album, and provides guidance for players to make the required volume adjustment during playback. The standard also specifies a means to prevent clipping when the calculated replay gain exceeds the limits of digital audio, and it describes how the replay gain information is stored within audio files.
 
The Replay Gain proposal sets out a simple way of calculating and representing the ideal replay gain for every track and album.


==Loudness measurement==
==Loudness measurement==
Loudness is a subjective measure of the intensity of sound. The correlation of perceived loudness to sound pressure level is determined by the peculiarities of the auditory system. ReplayGain attempts to model those peculiarities with the following measurement procedure.


===Loudness filter===
===Loudness filter===
[[File:RG_Equal_loudness_all.gif‎|frame|Figure 1: Target response (blue), high-pass response (green) and composite response (red)]]
[[File:RG_Equal_loudness_all.gif‎|frame|Figure 1: Loudness filter target response (blue), high-pass response (green) and composite response (red)]]


The human ear does not perceive sounds of all frequencies as having equal loudness. For example, a full scale sine wave at 1 kHz sounds much louder than a full scale sine wave at 100 Hz, even though the two have identical energy. To account for this, the signal is filtered by an inverted approximation to the equal loudness curves (sometimes referred to as Fletcher-Munson curves). The desired filter response derived from the equal loudness curves is shown in figure 1 (blue).
The human ear does not perceive sounds of all frequencies as having equal loudness. For example, a full-scale sine wave at 1 kHz sounds much louder than a full scale sine wave at 100 Hz, even though the two have identical energy. To account for this, the signal is filtered by an inverted approximation of the equal loudness curves (sometimes referred to as Fletcher&ndash;Munson curves) which describe the sensitivity of the ear as a function of frequency. The desired filter response derived from the equal loudness curves is shown in figure 1 (blue).


At higher frequencies a 10th order IIR filter designed by MATLAB's "yulewalk" function is an excellent approximation to our target. This is cascaded with a 2nd order Butterworth high pass filter, with a high pass frequency of 150 Hz. The resulting combined response (Figure 1 [red]) is close to our target response, and is used by Replay Gain.
At higher frequencies a 10th order IIR filter designed by MATLAB's "yulewalk" function is an excellent approximation to the target. This is cascaded with a 2nd order Butterworth high pass filter, with a high pass frequency of 150 Hz (Figure 1 [green]). The resulting combined response (Figure 1 [red]) is close to the target response, and is used by ReplayGain.


[[File:RG_IIR-filter.png|frame|Figure 2: IIR filter topology used by "yulewalk" and Butterworth filter components]]
[[File:RG_IIR-filter.png|frame|Figure 2: IIR filter topology used by "yulewalk" and Butterworth filter components]]
Line 107: Line 106:
<br style="clear:both" />
<br style="clear:both" />


===RMS energy calculation===
===RMS level calculation===
Next, the energy during each moment of the signal is determined by calculating the Root Mean Square of the waveform every 50ms. The block length of 50ms was chosen after studying the effect of values between 25ms and 1s. 25ms was too short to accurately reflect the perceived loudness of some sounds. Beyond 50ms there was little change (after statistical processing). For this reason, 50ms was chosen.
Next, the energy during each moment of the signal is determined by calculating the Root Mean Square (RMS) of the filtered signal every 50ms.<ref>The block length of 50ms was chosen after studying the effect of values between 25ms and 1s. 25ms was too short to accurately reflect the perceived loudness of some sounds. Beyond 50ms there was little change (after statistical processing). For this reason, 50ms was chosen.</ref>


It's easy to calculate the RMS energy over an entire audio file. Unfortunately, this value doesn't give a good indication of the perceived loudness of a signal. It's closer than that given by the peak amplitude, but it's still not good enough. For this reason, we have to calculate the RMS energy on a moment by moment basis, then do something useful with all that data.
The signal is chopped into 50ms long blocks. Then, for each block:<ref>If these steps are read backward, it should be clear why the process is called Root Mean Square averaging.</ref>
 
====General concept====
The signal is chopped into 50ms long blocks. Then, for each block:
# Every sample value is squared (multiplied by itself).
# Every sample value is squared (multiplied by itself).
# The mean average is taken.
# The mean average is taken.
# The square root of the average is calculated.
# The square root of the average is calculated.
If you read those steps backwards, it's should be clear why the process called Root Mean Square (RMS) averaging.


====Stereo files====
For stereo signals, in step 3, the mean average of all squared samples from both channels over the 50ms measurement interval is taken.<ref>One could sum channels of a stereo signal to mono before calculating the RMS level, but then any out-of-phase components (having the opposite signal on each channel) would cancel out to zero (i.e. silence). That's not how humans perceive them, so it's not a good solution.</ref>
The only difficulty lies in what to do with stereo files. We could sum them to mono before calculating the RMS energy, but then any out-of-phase components (having the opposite signal on each channel) would cancel out to zero (i.e. silence). That's not how we perceive them, so it's not a good solution.


The alternative is to calculate two RMS values (once for each channel) and then add them. Unfortunately a Linear addition still doesn't give the same effect as our ears.
The result of this calculation is then converted to a decibel representation as follows:


We get the right answer if we add the means of the channel-signals before calculating the square root. In mixing pan-pot terms, we're using "equal power" rather than "equal voltage". If we also assume that any mono (single channel) signal will always be replayed over two loudspeakers, we can treat a mono signal as a pair of identical stereo signals. Hence a mono signal gives (a+a)/2 (i.e. a), while a stereo signal gives (a+b)/2, where a and b are the mean squared values for each channel. After this, we carry out the square root and conversion to dB.
:<math>L=20 \log_{10} \frac{2{L_{RMS}}}{L_{p-p}}</math>


===Statistical Processing===
Where:
[[File:RG_Statistical_speech.gif‎‎|frame|Figure 5: Histogram of speech]]


[[File:RG_Statistical_classic.gif‎‎|frame|Figure 6: Histogram of classical music]]
:<math>L_{RMS}</math> is the RMS value calculated above
:<math>L_{p-p}</math> is the maximum peak-to-peak range of the samples in the audio file


[[File:RG_Statistical_pop.gif‎‎|frame|Figure 7: Histogram of pop music]]
===Statistical processing===
Where the average energy level of a signal varies with time, the louder moments contribute most to perception of overall loudness. For example, in human speech, over half the time is silence, but the perceived loudness of speech is primarily determined by the levels between silences.


Where the average energy level of a signal varies with time, the louder moments contribute most to our perception of overall loudness. For example, in human speech, over half the time is silence, but this does not affect the perceived loudness of the talker at all! For this reason, the RMS values are sorted into numerical order, and the value 5% down the list is chosen to represent the overall perceived loudness of the signal.
A good method to determine the overall perceived loudness is to sort the RMS values into numerical order, and then pick a value near the top of the list. For highly compressed pop music (e.g. Figure 5(c), where there are many values near the top), the choice makes little difference. For speech and classical music (Figures 5(a) and 5(b) respectively), the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is 95%,<ref>Based on experiments performed by David Robinson, "I tried values from 70% to 95%. For highly compressed pop music, the choice makes little difference. For speech and classical music, the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is around 95%, so this value is used by Replay Level."</ref> so this value is used by ReplayGain.


Having calculated RMS signal levels every 50ms through the file, a single value must be calculated to represent the perceived loudness of the entire file. The histograms in Figures 4, 5 and 6 show how many times each RMS value occurred in each file.
<gallery caption="Figure 5: Loudness histograms">
File:RG_Statistical_speech.gif‎‎|(a) Speech
File:RG_Statistical_classic.gif‎‎|(b) Classical music
File:RG_Statistical_pop.gif‎‎|(c) Pop music
</gallery>


A good method to determine the overall perceived loudness is to sort the RMS energy values into numerical order, and then pick a value near the top of the list. For highly compressed pop music (e.g. Figure 7, where there are many values near the top), the choice makes little difference. For speech and classical music, the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is around 95%, so this value is used by Replay Gain.
==Reference level==
The audio industry does not have a standard for playback system calibration, but in the movie industry a calibration standard has been defined by the Society of Motion Picture and Television Engineers (SMPTE).<ref>SMPTE RP 200:2002 &ndash; Relative and Absolute Sound Pressure Levels for Motion-Picture Multichannel Sound Systems &ndash; Applicable for Analog Photographic Film Audio, Digital Photographic Film Audio and D-Cinema</ref> The standard states that a single channel pink noise signal with an RMS level of -20 dB relative to a full-scale sinusoid<ref>"dB relative to a full-scale sinusoid" is preferred over "dBFS" as a unit of measure in this specification because there is some ambiguity whether the reference for dBFS is a full-scale square wave (peak reference) or a sine wave (RMS reference).</ref> should be reproduced at 83 dB SPL.<ref>Measured using a C-weighted, slow averaging SPL meter.</ref>


<br style="clear:both" />
ReplayGain adapts the SMPTE calibration concept for music playback. Under ReplayGain, audio is played so that its loudness, as measured using the procedures described in [[#Loudness measurement|Loudness measurement]] above, matches the loudness of a pink noise signal with an RMS level of -14 dB relative to a full-scale sinusoid,<ref>The initial ReplayGain proposal used the same -20 dB reference used by SMPTE. The reference was raised to -14 dB early on in ReplayGain development. This reference is used in all current ReplayGain implementations.</ref> also measured using the procedures described above.


===Calibration with reference level===
In ReplayGain implementations, the reference level is described in terms of the SMPTE SPL playback level. By the SMPTE definition, the 83 dB SPL reference corresponds to -20FS dB system headroom. The -14 dB headroom used by ReplayGain therefore corresponds to an 89 dB SPL playback level on a SMPTE calibrated system and so is said to be operating with an 89 dB reference level.
A suitable average replay level is 83 dB SPL. A calibration relating the energy of a digital signal to the real world replay level has been defined by SMPTE.<ref>SMPTE RP 200:2002  Relative and Absolute Sound Pressure Levels for Motion-Picture Multichannel Sound Systems – Applicable for Analog Photographic Film Audio, Digital Photographic Film Audio and D-Cinema</ref> Using this calibration, we subtract the current signal from the desired (calibrated) level to give the difference. We store this difference in the audio file.


====Finding a standard====
SMPTE cinema calibration calls for a single channel of pink noise reproduced through a single loudspeaker. In music applications, the ideal level of the music is actually the loudness when both speakers are in use. So, ReplayGain is calibrated to two channels of pink noise.<ref>In reality, a monophonic pink noise wave file is used, and ReplayGain automatically assumes the file is being played through both speakers, as would any monophonic file.</ref>
Having calculated a representative RMS energy value for the audio file, we now need to reference this to a real world sound pressure level. The audio industry doesn't have any standard for listening level, but the movie industry has worked to an 83 dB standard for years.<ref>This number (83 dB SPL) wasn't picked at random. It represents a comfortable average listening level, determined by professionals from years of listening. That reference level of -20 dB pink noise isn't random either. It causes the calibrated average level to be 20 dB less than the peak level. In other words, it leaves 20 dB of headroom for louder than average signals. So, if CDs were mastered this way, the average level would be around -20 dBFS, leaving lots of room for the dramatic peaks which make music exciting.</ref>


What the standard actually states is that a single channel pink noise signal, with an RMS energy level of -20 dB relative to a full scale sinusoid should be reproduced at 83 dB SPL (measured using a C-weighted, slow averaging SPL meter). In simple terms, this means that everyone can set their volume control to the same (known, calibrated) gain.
==Gain calculation==
RG achieves loudness compensated playback by applying gain (or attenuation) dependent on the measured loudness of the audio file relative to the established reference level. The gain is calculated as follows:
:<math>RG=L_{n14}-L</math>
Where all quantities are expressed in decibels:
:<math>RG</math> is the replay gain adjustment,
:<math>L_{n14}</math> is the measured loudness of the -14 dB pink noise reference and
:<math>L</math> is the measured loudness of the audio file.


====An ideal world...====
Replay gain is positive if the loudness of the audio file is lower than the pink noise reference. The gain is negative (representing an attenuation) if the loudness of the audio file is higher than that of the reference. The gain is stored as metadata with the audio file as described below and is used by players to adjust output volume of tracks as they are played as described in [[#Player requirements|Player requirements]] below.
If the mastering engineer sets the levels on a CD using that calibrated volume control setting, that CD will sound best at that volume. If all CDs were mastered in such a way, they'd all sound best at that volume. If you (as a listener) didn't want to listen at that particular volume setting, you could always turn it down, but all CDs would still sound equally "turned down" at your preferred setting. You wouldn't have to change the volume setting between discs.


Reality check! We know CDs aren't made like this. There is NO audio standard replay level. So, here's the clever bit - here's the whole point of this website...
==Metadata==
For ReplayGain to do its work during playback, four values must be stored as metadata<ref>Metadata is "data about data." For example, the ID3 ''de facto'' standard provides a way to store artist, title, album title, track number, and other metadata in data blocks called "tags" immediately before or after the audio data in an MP3 file. Other metadata storage/tagging standards and conventions exist for other audio file formats.</ref> with or within the audio file:
# Peak track amplitude
# Peak album amplitude
# Track replay gain
# Album replay gain


====Fixing a non-ideal world====
If calculated for an individual track, the loudness measurement (as specified above) yields track replay gain. If calculated on an album basis, with all tracks concatenated to make one long audio file, the loudness measurement yields album replay gain.
We know the level should average around 83 dB SPL, and we know a -20 dB pink noise signal will give 83 dB SPL in a calibrated system. So, we send the pink noise signal through the ReplayGain program, and store the result (let's call it ref_Vrms). For every CD we process, the difference between the calculated value for that CD and ref_Vrms tells you how much you need to scale the signal in order to make it average 83 dB.


The actual process is quicker to do than to say!
===Replay gain===
Under some listening conditions, it's useful to have every track sound equally loud. The problem with a track-by-track approach is that tracks which should be quiet in the context of the album on which they reside will be brought up to the level of all the rest. For casual listening, or in a noisy background, this can be a good thing. For serious listening, it does not respect the intent of the artist or mastering engineer; a tender ballad track will be blasting at the same loudness as a hard rock track on the same album. It's generally ideal to leave the intentional loudness differences between tracks in place, yet still correct for unmusical and annoying loudness differences between albums. To accomplish this, ReplayGain suggests that two different gain adjustments should be stored as metadata with each sound file.


====One complication====
''Album replay gain'' represents the ideal listening gain for an entire album. ReplayGain reads the collection of tracks that comprise a album, and calculates a single replay gain for the whole set. This single gain can be used for playback of all tracks of the album. Intentionally quiet tracks then stay appropriately quieter than the rest. It still solves the basic problem (annoying, unwanted level differences between discs) because quiet or loud discs are still adjusted overall&mdash;so the pop CD that's 20 dB louder than the classical CD will be brought into line.
The system calibration uses a single channel of pink noise (reproduced through a single loudspeaker). You then play music through both loudspeakers. So, though we use 1 channel of pink noise to calibrate the system gain, the ideal level of the music is actually the loudness when both speakers are in use. So, in ReplayGain, we calibrate to 2 channels of pink noise, because that's how loud we'd like the music to sound. In reality, we just have a monophonic pink noise wavefile, and ReplayGain automatically assumes you're playing it through both speakers, as it would any monophonic file.


==Storing the Replay Gain==
===Peak amplitude===
Under some listening conditions, it's useful to have every track sound equally loud. However, sometimes we want to leave the interntional loudness differences between tracks in place, whilst still correcting for unmusical and annoying changes in loudness between discs.
Scanning a track or album for the peak amplitude can be a time-consuming process. Therefore, it's helpful if this single value is stored as metadata. This is used to predict whether the required replay gain adjustment will cause clipping during playback.


The Replay Gain suggests that two different gain adjustments should be stored in the file header, as follows.
The maximum peak amplitude value is stored as a floating point number, where 1.0 represents digital full scale. As with replay gain values, separate peak amplitude values are stored per track and per album.


===Track Replay Gain adjustment===
For uncompressed files simply, scanners store the maximum absolute sample value held in the file on any channel for positive or negative excursion. The single sample value should be converted to a floating-point representation, such that digital full scale is equivalent to a value of 1.0.


This will make all tracks sound equally loud. If the ReplayGain is calculated on a track-by-track basis (i.e. an individual ReplayGain calculation is carried out for each track), this will be the result.
Psychoacoustically coded audio, such as MP3, does not exist as a sequence of samples until it is decoded. Psychoacoustic coding of a heavily limited file can lead to sample values larger than digital full scale upon decoding. The coded files must be decoded using a fully compliant decoder that allows peak overflows (i.e. has headroom) and may result in peak amplitude values greater than 1.0.


===Album Replay Gain adjustment===
==Metadata format==
From the standpoint of metadata storage, each audio file format presents a unique situation. There are three favored schemes defined for storage of ReplayGain metadata: '''ID3v2''', '''Vorbis comments''' and '''APEv2'''. A survey of file formats is listed below with metadata schemes in order of preference for each:
* .aac (Advanced Audio Coding raw format) &ndash; No metadata support (use .mp4 instead)
* .aiff, .aif, .aifc (Apple Interchange File Format) &ndash; '''ID3v2''' (in "ID3" IFF chunk)
* .ape, .apl (Monkey's Audio) &ndash; '''APEv2'''
* .bwf (Broadcast Wave Format) &ndash; '''ID3v2''' (in RIFF chunk)
* .flac (Free Lossless Audio Codec) &ndash; '''Vorbis comments'''
* .mp3 (MPEG audio layer 3) &ndash; '''ID3v2''', LAME VBR proposed tag specification
* .mp4 also .m4a, .m4b, .m4p, m4r (MPEG-4 Part 14) &ndash; '''ID3v2''' (in "ID32" box)
* .mpc (Musepack) &ndash; '''APEv2'''
* .ogg (Ogg Vorbis) &ndash; '''Vorbis comments'''
* .tta (True Audio) &ndash; '''ID3v2''', '''APEv2'''
* .wma (Windows Media audio) - '''Vorbis comments''' in Extended Content Description Object
* .wav (Windows PCM) &ndash; No metadata support (use .bwf instead)
* .wv (WavePak) &ndash; '''APEv2'''


The problem with the Track setting is that tracks which should be quiet will be brought up to the level of all the rest. For casual listening, or in a noisy background, this can be a good thing. For serious listening, it would be a nuisance. You don't want a solo flute track blasting at the same loudness as Iron Maiden!
===ID3v2===
The ID3v2 standard<ref>The ID3v2 format is explained at [http://www.id3.org/ www.id3.org]. The most useful document is the [http://www.id3.org/id3v2.3.0.html ID3v2 v2.3.0 standard]. Although this document has been superseded by v2.4.0, the earlier document is complete (rather than an update), and in indexed HTML form. As such, it represents a better technical introduction to ID3v2.</ref> defines a ''tag'' which is situated before the data in an MP3 file.<ref>The original ID3 (v1) tags resided at the end of the file, and contained a few fields of information. The ID3v1 tag is not extensible and therefore cannot support ReplayGain metadata.</ref> ID3 is used primarily with MP3 audio files but means of adapting the system to other file types have been developed.


To solve this problem, the Album setting represents the ideal listening gain for each track. ReplayGain can have a good guess at this too, by reading the entire CD, and calculating a single gain adjustment for the whole disc. This works because quiet tracks then stay quiter than the rest, since the gain won't be changed for each track. It still solves the basic problem (annoying, unwanted level differences between discs) because quiet or loud discs are still adjusted overall - so the pop CD that's 20 dB louder than the classical CD will be brought into line.
The ID3v2 tag is divided into ''frames''. The preferred means of storing ReplayGain metadata is use of ''TXXX'' key/value pair frames. Two other legacy schemes for storing ReplayGain metadata exist: [[ReplayGain_legacy_metadata_formats#ID3v2_RGAD|RGAD]] and [[ReplayGain_legacy_metadata_formats#ID3v2_RVA2|RVA2]]. These formats are documented in the [[ReplayGain legacy metadata formats|appendix]]. Players may choose to look for these formats if metadata in the ''TXXX'' format is not found in the ID3v2 tag. New scanners may write these older formats in addition to the newer (TXXX) ones if they wish to remain backwards compatible with older players.


===Replay gain data format===
ReplayGain uses four TXXX frames. The header of a TXXX frame is coded as follows:
The calibration level of 83 dB can be added to the difference from the previous calculation, to yield the actual Replay Gain. NOTE: we store the differential, NOT the actual Replay Gain.


====What to store====
Frame ID      $54 58 58 58 ("TXXX")
Size          $xx xx xx xx (size of frame excluding this header)
Flags          $40 $00      (discard frame if audio data is altered)


Three values must be stored.
Frame data is coded as follows:
# Peak signal amplitude
# Track = Replay Gain adjustment required to make all tracks equal loudness
# Album = Replay Gain adjustment required to give ideal listening loudness
If calculated on a track-by-track basis, ReplayGain yields (2). If calculated on a disc-by-disc basis, ReplayGain will usually yield (3).


To allow for future expansion: If more than three values are stored, players should ignore those they do not recognise, but process those that they do. If additional Replay Gain adjustments other than Track and Album are stored, they should come after Track and Album. The Peak Amplitude must always occupy the first 4 bytes of the Replay Gain header frame. The three values listed above (or at least fields to hold the three values, should the values themselves be unknown) are required in all Replay Gain headers.
Text encoding  $00          (ISO-8859-1 encoding)
Description    <key string> $00
Value          <value string>


====Range====
The four frames associated with ReplayGain metadata use the following key/value pairs
The replay gain adjustment must be between -51.0 dB and +51.0 dB. Values outside this range must be limited to be within the range, though they are certainly in error, and should probably be re-calculated, or stored as "not set". For example, trying to cause a silent 24-bit file to play at 83 dB will yield a replay gain adjustment of +57 dB.


In practice, adjustment values from -23 dB to +17 dB are the likely extremes, and values from -18 dB to +2 dB are more usual.
{| class="wikitable"
 
|+Table 3: Metadata keys and value formatting
====Bit format====
|-
Each Replay Gain value should be stored in a Replay Gain Adjustment field consisting of two bytes (16 bits). Here are two example Replay Gain Adjustment fields:
!Metadata
 
!Key
Track gain adjustment
!Value format
 
|-
<tt>
|Track replay gain
0 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1
|REPLAYGAIN_TRACK_GAIN
\___/ \___/ | \_______________/
|[-]a.bb dB
  |    |  |        |       
|-
name    |  sign      |       
|Peak track amplitude
code    |  bit        |       
|REPLAYGAIN_TRACK_PEAK
        |            |       
|c.dddddd
    originator        |      
|-
      code            |      
|Album replay gain
                  Replay Gain 
|REPLAYGAIN_ALBUM_GAIN
                  Adjustment 
|[-]a.bb dB
</tt>
|-
 
|Peak album amplitude
Album gain adjustment
|REPLAYGAIN_ALBUM_PEAK
<tt>
|c.dddddd
0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
|}
\___/ \___/ | \_______________/
  |    |  |        |
name    |  sign      |
code    |  bit        |
        |            |
    originator        |
      code            |
                  Replay Gain
                  Adjustment
</tt>
 
In the above example, the Track Gain Adjustment is -12.5 dB, and was calculated automatically. The Album Gain Adjustment is +2.0 dB, and was set by the user.
 
=====Name code=====
:000 = not set
:001 = Track Gain Adjustment
:010 = Album Gain Adjustment
:other = reserved for future use
 
If space has been reserved for the Replay Gain in the file header, but no replay gain calculation has been carried out, then all bits (including the Name code) may be zero.
 
For each Replay Gain Adjustment field, if the name code = 000 (not set), then players should ignore the rest of that individual field.
 
For each Replay Gain Adjustment field, if the name code is an unrecognised value (i.e. not 001-Track or 010-Album), then players should ignore the rest of that individual field.
 
If no valid Replay Gain Adjustment fields are found (i.e. all name codes are either 000 or unknown), then the player should proceed as if the file contained no Replay Gain Adjustment information (see player requirements).
 
=====Originator code=====
:000 = Replay Gain unspecified
:001 = Replay Gain pre-set by artist/producer/mastering engineer
:010 = Replay Gain set by user
:011 = Replay Gain determined automatically, as described in Calculating (above)
:other = reserved for future use
For each Replay Gain Adjustment field, if the name code is valid, but the Originator code is 000 (Replay Gain unspecified), then the player should ignore that Replay Gain adjustment field.
 
For each Replay Gain Adjustment field, if the name code is valid, but the Originator code is unknown, then the player should still use the information within that Replay Gain Adjustment field. This is because, even if we are unsure as to how the adjustment was determined, any valid Replay Gain adjustment is more useful than none at all.
 
If no valid Replay Gain Adjustment fields are found (i.e. all originator codes are 000), then the player should proceed as if the file contained no Replay Gain Adjustment information (see player requirements).
 
=====Sign bit=====
:0 = positive gain (boost)
:1 = negative gain (attenuation)
 
=====Replay Gain Adjustment=====
The value, multiplied by ten, stripped of its sign (since the + or - is stored in the "sign" bit), is represented in 9 bits. e.g. -3.1 dB becomes 31 = 000011111.
 
====Default Value====
$00 $00 (0000000000000000) should be used where no Replay Gain has been calculated or set. This value will be interpreted by players in the same manner as a file without a Replay Gain field in the header (see player requirements).


The values of xxxyyy0000000000 (where xxx is any name code, and yyy is any originator code) are all valid, but indicate that the Replay Gain is to be left at 83 dB (0 dB Replay Gain Adjustment). These are not default values, and should only be used where appropriate (e.g. where the user, producer, or Replay Gain calculation has indicated that the correct Replay Gain is 83 dB).
Gains are specified textually in decibels. Negative gains (attenuation) are prefixed with a '-'. Positive gains have no prefix. Integral portion of the gain (a) may be one or two numeric (0-9) digits. If there is no integral portion the field is '0'. The decimal portion of the gain (bb) is two numeric digits. Gains are suffixed with a space followed by 'dB'.


====Illegal Values====
Peak levels are specified textually as a positive decimal. Peak level is a dimensionless quantity with 1.000000 representing full scale. No suffix is included on peak values. The integer field (c) is typically 1 or 0. Six numeric digits in the decimal field (dddddd) is adequate to accurately represent peak values for 16-bit audio data.
The values xxxyyy1000000000 are all illegal. If enountered, players should treat them in the same manner as $00 $00 (the default value).


The value $xx $ff is not illegal, but it would give a false synch value within an mp3 file. The problems this may cause should be investigated, and a solution (e.g. unsychronisation) sought. Maybe this is a use for negative zero?
A robust player should be prepared to parse the following variations in either replay gain or peak level metadata:
*Positive gains with leading '+'
*More or fewer significant digits than specified in any field
*Leading zeros or spaces in integer fields
*Missing or malformed 'dB' suffix (e.g. no space between numeric digits and suffix, alternate capitalization)
*Alternate capitalization of keys


===Peak amplitude data format===
Other formatting errors indicate more severe problems and should result in player ignoring data as if the frame did not exist.
Scanning the file for the peak amplitude can be a time-consuming process. Therefore, it's helpful if this single value is stored within the file header. This can be used to check if the required replay gain adjustment will cause the file to clip.


====Data Format====
===Vorbis comments===
The maximum peak amplitude (a single value) should be stored as a 32-bit floating point number, where 1=digital full scale.
A Vorbis comment<ref>[http://www.xiph.org/vorbis/doc/v-comment.html Vorbis comment metadata format]. ReplayGain metadata is documented on the [http://wiki.xiph.org/VorbisComment#Replay_Gain Xiph Wiki].</ref> uses an ASCII <tt>key=value</tt> format. When Vorbis comments are used, the four ReplayGain metadata items are stored as separate comments. The ''keys'' and formatting for ''values'' is the same as specified for ID3v2. Keys and values are required by the Vorbis comment specification to be separated by '=' (equal character).


====Uncompressed Files====
===APEv2===
Simply store the maximum absolute sample value held in the file (on any channel). The single sample value should be converted to a 32-bit float, such that digital full scale is equivalent to a value of 1.
The APEv2 metadata format<ref>[http://wiki.hydrogenaudio.org/index.php?title=APEv2_specification APEv2 Specification at Hydrogen Audio Wiki]</ref> also organizes data into key/value pairs. Keys are ASCII format. A flags field allows support for several value formats including UTF-8 and binary. Under APEv2, ReplayGain meta data is stored using the same keys and data as ASCII values in the same format as specified for ID3v2.
 
====Compressed files====
Compressed audio does not exist as a waveform until it is decoded. Unfortunately, psychoacoustic coding of a heavily limited file can lead to sample values larger than digital full scale upon decoding. However, it is likely that such values will be brought back within range after scaling by the replay level. Even so, it is necessary to store the peak value of a compressed file as a 32-bit floating-point representation, where +/-1 represent digital full scale, and values outside this range would usually clip.
 
====Implementation====
For uncompressed files, the maximum values must be found and stored. For compressed files, the files must be decoded using a fully compliant decoder that allows peak overflows (i.e. has headroom), and the maximum value stored.
 
===Replay Gain File Format===
Three values must be stored.
# Peak signal amplitude
# Track = Replay Gain adjustment required to make all tracks equal loudness
# Album = Replay Gain adjustment required to give ideal listening loudness
 
Each audio file format represents a unique situation. All audio files would benefit from the inclusion of Replay Gain information. In the following list, the links take you to a suggested format for storing the 3 values within the file. Where there is no link, I'm awaiting suggestions!
* .ape
* .mp3 - ID3v2, LAME VBR proposed tag specification
* .mpc
* .ogg
* .wav
 
====ID3v2====
The ID3v2 standard<ref>The ID3v2 format is explained at [http://www.id3.org www.id3.org]. The most useful document is the [http://www.id3.org/id3v2.3.0.html ID3v2 v2.3.0 standard]. Whilst this document has been superseded by v2.4.0, the earlier document is complete (rather than an update), and in indexed HTML form. As such, it represents a better technical introduction to ID3v2.</ref> defines a "tag" which is situated before the data in an mp3 file. The original ID3 (v1) tags resided at the end of the file, and contained a few fields of information. The ID3v2 tags can contain virtually limitless amounts of information, and new "frames" within the tags may be defined. ID3 is used with MP3 and MP4 audio files.
 
In the language of the ID3v2 standard document, the Replay Gain tag is thus:<ref>This tag specification is not part of any version of the ID3 specification but is [http://www.id3.org/Replay_Gain_Adjustment acknowledged] as an "in the wild" tag byt the ID3 standards organization.</ref>
 
<tt>
<Header for 'Replay Gain Adjustment', ID: "RGAD">
Peak Amplitude $xx $xx $xx $xx
Track Replay Gain Adjustment $xx $xx
Album Replay Gain Adjustment $xx $xx
</tt>
 
Header consists of:
 
<tt>
Frame ID $52 $47 $41 $44 = "RGAD"
Size $00 $00 $00 $08
Flags $40 $00 (%01000000 %00000000)
</tt>
 
In the RGAD frame, the above specified ''flags'' value states that the frame should be preserved if the ID3v2 tag is altered, but discarded if the audio data is altered.


==Player requirements==
==Player requirements==
[[File:RG_Player_control.gif‎|frame|Figure 8: Possible Replay Gain control panel]]
[[File:RG_Player_control.gif‎|frame|Figure 8: Example ReplayGain control panel]]
 
In practice, scalaing and pre-amp can be carried out in a single step, where each sample is multiplied by a fixed amount. The clipping prevention need only be carried out if, after the first two adjustments, the peak signal amplitude is above digital full scale.
 
The three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. Thus maximum signal to noise ratio is maintained in the digital signal and DAC process.
 
===Scale audio to match Replay Gain===
The Player reads the Replay Gain value, and scales the audio data as appropriate.


====Reading the Replay Gain====
Loudness normalization, pre-amplification and clipping prevention are the operations performed by a ReplayGain player.
First, the player needs to determine if the user requires Track style level normalization (all tracks same loudness), or Album style level normalization (all tracks "ideal" loudness). This option should be selectable in the Replay Gain control panel, and should default to Track.


Then the player reads the appropriate Replay Gain adjustment value from the file header, and converts it back to its original dB value. See the Replay Gain Data Format for more details.
===Loudness normalization===
To properly normalize loudness, the player needs to determine if the user desires Track style level normalization (all tracks same loudness), or Album style level normalization (all albums same loudness, tracks of an album played at the same relative level as on the original release). This option should be selectable in the ReplayGain control panel (Figure 8). The player reads the corresponding gain metadata value from the file and scales the audio data as appropriate. Scaling the audio data simply means multiplying each sample value by a constant value. This constant is given by:


The player also needs to read (or calculate) the Peak amplitude. This is required for Clipping prevention.
:<math>10^\frac{gain}{20}</math>


====Scaling by the Replay Gain adjustment====
Or, in words, ten raised to the power of one-twentieth of replay gain.<ref> After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard&mdash;not the file (i.e. after ReplayGain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).</ref>
Changing the level of an audio signal simply means multiplying each sample value by a constant value. This constant is given by:
:<tt>scale=10.^(replay_gain/20);</tt>
Or, in words: ten to the power of (the replay gain divided by 20).


After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard - not the file (i.e. after Replay Gain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).
If the file only contains one of the replay gain adjustments (e.g. Album) but the user has requested the other (Track), then the player should use the one that is available (in this case, Album). If neither (Track or Album) gain metadata is available, then the player needs to choose a suitable default gain. Potential choices include unity gain (0 dB) or an average of gains from other tracks in the album or playlist.


====If the Replay Gain information is absent...====
===Pre-amplification===
Simply disabling Replay Gain control for tracks without Replay Gain information would cause these tracks to be louder than the others, so bringing back the original problem!
Although the calibration level used by ReplayGain suggests that the average level of an audio track should be 14 dB below full scale, some pop music is dynamically compressed to peak at 0 dB and average around 3 dB below full scale. This means that, when the replay gain is applied, the level of such tracks will be reduced by 11 dB! If users are listening to a mixture of highly compressed and more dynamic tracks, ReplayGain will make the listening experience more pleasurable by bringing the level of the compressed tracks down into line with that of the others. However, if users are only listening to highly compressed music, then they may complain that all their files are now too quiet.<ref>This problem can be especially noticeable on portable players with limited output or gain.</ref>


If neither (Track or Album) Gain adjustment is set, or if the track does not contain Replay Gain information, then the player should use an average of the previous 10 Replay Gains. This represents the typical loudness of tracks in the users music collection, and is a much better estimate of the likely Replay Gain than 0 dB, or no adjustment at all.
To address this problem, a pre-amp feature should be incorporated into the player. A user-supplied pre-amp setting is an adjustment to the calculated replay gain. It should default to perform no adjustment. This means that casual users will experience a moderate reduction in the loudness of their compressed pop music. Less-compressed material can generally be played at the same loudness without clipping. Normalization of more dynamic material may cause clipping or invoke the [[#Clipping prevention|clipping prevention]] mechanism (see below). Power users and audiophiles can reduce the pre-amp gain to enjoy the full dynamic range of all of their music.


If the file only contains one of the Replay Gain adjustments (e.g. Album) but the user has requested the other (Track), then the player should use the one that is available (in this case, Album).
If enabled, the player should read the user selected pre-amp gain, and scale the audio signal by the appropriate amount. For example, a +6 dB gain requires a scale of 10<sup>6/20</sup>, which is approximately 2. The replay gain and pre-amp scale factors can be combined<ref>Scale factors in  Decibel units are added to produce the same effect as multiplying scale factors in linear units.</ref> for simplicity and ease of processing.


===Pre-amp===
===Clipping prevention===
Most users who only play pop music will find that the level has been reduced too far for them. An selectable boost of 6 dB should be included by default, otherwise normal users may be disappointed by low output level. Knowledgeable users, or those playing classical music, will disable this. Some may even choose to decrease the level. For user friendliness, this part should be referred to as the "pre-amp".
ReplayGain's suggestion of a -14 dB average playback level leaves sufficient headroom for the bulk of modern recordings. Nevertheless, there exists the possibility that after application of replay gain and pre-amp adjustment, a track may exceed full scale during its dynamic peaks. Without intervention, this will result in clipping, a severe form of distortion. Factors introducing the possibility of clipping include:


Whilst the SMPTE calibration level we're using suggests that the average level of an audio track should be 20 dB below full scale (to leave room for peaks), some pop music is dynamically compressed to peak at 0 dB and average around -3 dB. This means that, when the Replay Gain is correctly set, the level of such tracks will be reduced by 17 dB! If users are listening to a mixture of highly compressed and not compressed tracks, then Replay Gain will make the listening experience more pleasurable, by bringing the level of the compressed tracks down into line with that of the others. However, if users are only listening to highly compressed music, then they are likely to complain that all their files are now too quiet.
# Recordings from certain genres and certain periods in the history of commercial recordings require additional headroom. Although these recordings can be accommodated through a downwards adjustment of the pre-amp setting, it may be difficult to determine a safe adjustment and it may be undesirable to lower average level to accommodate the rare track which requires it.
# ReplayGain will make loud dynamically compressed tracks quieter, and quiet dynamically uncompressed tracks louder. The average levels will then be similar, but the quiet tracks will actually have louder peaks. If the user pushes the pre-amp gain upwards the peaks of the (originally) quieter tracks will be pushed well over full scale.
# In coded audio (e.g. MP3 files) a file that was hard-limited to digital full scale before encoding will often be pushed over the limit by the psychoacoustic compression. A decoder with headroom can recover the over full scale signal by reducing the gain.


To solve this problem, a Pre-amp should be incorporated into the player. This is basically just an adjustment to the scale factor we calculated on the previous page. It should default to a +6 dB boost. This means that casual users will find little change to the loudness of their compressed pop music (except that the occasional "problem" quiet track will now be clipped, compressed or not as loud as the rest), while power users and audiophiles can reduce the Pre-amp gain to enjoy all their music.
ReplayGain suggests two possible solutions which prevent clipping in these situations. A player should support one or both of these.


If the Pre-amp gain is set too high, peaks will be compressed (see Clipping Prevention [below]). However, this is exactly what radio stations do all the time, and many listeners like this sound.
====Audio limiting====
In situation 2 above, the user clearly wants all the music to sound very loud. To give them their wish, any signal which would peak above digital full scale should be hard limited at just below digital full scale. This is also useful at lower pre-amp gains, where it allows the average level of classical music to be raised to that of pop music, without distorting. The exact type of nature limiting or compression an implementation choice for the player.<ref>Something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate for pop music at least.</ref>


====Implementation====
====Reduced gain====
If enabled, simply read the user selected pre-amp gain, and scale the audio signal by the appropriate amount. For example, a +6 dB gain requires a scale of 10.^(6/20), which is approximately 2. The Replay Gain and Pre-amp scale factors can be multiplied together<ref>Scale factors in  Decibel units are added to produce the same effect as multiplying scale factors in linear units.</ref> for simplicity and ease of processing.
The audiophile user will not want any compression or limiting on the signal. In this case the only option is to automatically and temporarily reduce the pre-amp gain below the user-selected setting for tracks where clipping would otherwise occur. Clipping can be predicted by examining the peak level of the track or album being played.


===Clipping Prevention===
The player must read the peak amplitude metadata. If peak level metadata is unavailable, the player should assume a peak level of 1.0. If the peak level for both track and album is stored as metadata in the file, it is possible to calculate if, following the replay gain adjustment and pre-amp gain, the signal will clip at some point. If it won't, then no further action is necessary.  
The player should, by default, apply hard limiting (NOT CLIPPING) to any signal peaks that would go over full scale after the above two operations. This should be user defeatable, so that audiophile users can choose to decrease the overall level to avoid clipping, rather than limiting the signal.


====Why might the signal clip?====
An overall scale factor for loudness normalization taking into account replay gain, pre-amp setting and clipping prevention through gain reduction is given below.
There are 3 reasons:
# In coded audio (e.g. mp3 files) a file that was hard-limited to digital full scale before encoding will often be pushed over the limit by the psychoacoustic compression. A decoder with headroom can recover the over full scale signal by reducing the gain.
# Replay Gain will make loud dynamically compressed tracks quieter, and quiet dynamically uncompressed tracks louder. The average levels will then be similar, but the quiet tracks will actually have louder peaks. If the user pushes the pre-amp gain to maximum (which would take highly compressed pop music back to its original level), then the peaks of the (originally) quieter tracks will be pushed well over full scale.
# If a track has a very wide dynamic range, then even without turning up the pre-amp, the replay gain itself may instruct the player to turn the track up such that it would clip, simply because the average energy is so low, but the peak amplitude is very high. If anyone does find a recording which causes this with the pre-amp gain set at 0, please let me know!


====What can we do about it?====
:<math>min( 10^\frac{RG + G_{pre-amp}}{20}, \frac{1}{peak amplitude} )</math>
The simple option is to let it clip! However, this isn't a good idea, as it'll sound awful. There are two solutions:
 
In situation 2 above, the user clearly wants all the music to sound very loud. To give them their wish, any signal which would peak above digital full scale should be hard limited at just below digital full scale. This is also useful at lower pre-amp gains, where it allows the average level of classical music to be raised to that of pop music, without distorting. This could be useful for making tapes for the car. The exact type of limiting/compression is up to the player, but something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate (for pop music at least).
 
The audiophile user will not want any compression or limiting on the signal. In this case the only option is to reduce the pre-amp gain (so that the scaling of the digital signal is lower than that suggested by the replay level). In order to maintain the consistency of level between tracks, the pre-amp gain should remain at this reduced level for subsequent tracks.
 
====Implementation====
If the Peak Level is stored in the header of the file, it is trivial to calculate if (following the Replay Gain adjustment and Pre-amp gain) the signal will clip at some point. If it won't, then no further action is necessary. If it will, then either the hard limiter should be enabled, or the pre-amp gain should be reduced accordingly before playing the track.


===Hardware implementation===
===Hardware implementation===
The above three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. The clipping problem can be addressed by providing adequate headroom in the analog circuitry. Bit transparency and maximum signal to noise ratio is maintained in the digital signal and DAC process.
The above three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. The clipping problem can be addressed by providing adequate headroom in the analog circuitry. Bit transparency and maximum signal to noise ratio is maintained in the digital signal and DAC process.<ref>A system using today's 24-bit converters is unlikely to appreciate any overall gain in system performance with such an arrangement. A digitally-controlled analog gain element typically introduces significant noise and distortion.</ref>


==Acknowledgements==
==Acknowledgements==
The Replay Gain proposal was developed by David Robinson and was originally published 10 July 2001. Additional updates were published by David Robinson through 10 October 2001.
The [http://replaygain.hydrogenaudio.org/proposal original ReplayGain proposal] (an [http://replay.waybackmachine.org/20090306202649/http://www.replaygain.org/ archive] is also available) was developed by David Robinson and was published 10 July 2001. Additional updates were published by David Robinson through 10 October 2001.
 
The following acknowledgement was included with the original proposal, "The algorithm to calculate an ideal replay gain has grown from my research into human hearing, with many additional ideas drawn from the work of E. Zwicker, and Brian Moore. I am currently completing my PhD at the University of Essex, and have been funded by the EPSRC." Additionally David Robinson credited Glen Sawyer (Snelg) and Jim Casaburi (Walrus) for software contributions and Bob Katz and Matt Ashland for ideas.


The following acknowledgement was included with the original proposal, "The algorithm to calculate an ideal replay gain has grown from my research into human hearing, with many additional ideas drawn from the work of E. Zwicker, and Brian Moore. I am currently completing my PhD at the University of Essex, and have been funded by the EPSRC."
This updated ReplayGain specification reflecting current and recommended practice was prepared by Kevin Gross in 2011.


Additionally David Robinson credited Glen Sawyer (Snelg) and Jim Casaburi (Walrus) for software contributions and Bob Katz and Matt Ashland for ideas.
==Contact==
For ReplayGain-related questions or contributions, please post in the [http://www.hydrogenaudio.org/forums/index.php?showforum=1 General Audio] section of the Hydrogen Audio forums.
==Appendix==
# [[ReplayGain legacy metadata formats]]


==Notes==
==Notes==
<references />
<references />


==References==
== See also ==
*[http://replaygain.hydrogenaudio.org Original Replay Gain proposal by David Robinson]
: ''This is not a normative part of the specification.''
* [[ReplayGain 2.0 specification]] (draft)

Latest revision as of 14:37, 3 August 2023

Although music is encoded to a digital format with a clearly defined maximum peak amplitude, and although most recordings are normalized to utilize this peak amplitude, not all recordings sound equally loud. This is because once this peak amplitude is reached, perceived loudness can be further increased through signal-processing techniques such as dynamic range compression and equalization.[1] Therefore, the loudness of a given album has more to do with the year of issue or the whim of the producer than the intended emotional effect. Because of this, a random play through a music collection can have one leaping for the volume control every other track.

There is a solution to this annoyance: within each audio file, information can be stored about what volume change would be required to play each track or album at a standard loudness, and players can use this "replay gain" information to automatically nudge the volume up or down as required.

The ReplayGain specification is a standard which defines an appropriate reference level, explains a way of calculating and representing the ideal replay gain for a given track or album, and provides guidance for players to make the required volume adjustment during playback. The standard also specifies a means to prevent clipping when the calculated replay gain exceeds the limits of digital audio, and it describes how the replay gain information is stored within audio files.

Loudness measurement

Loudness is a subjective measure of the intensity of sound. The correlation of perceived loudness to sound pressure level is determined by the peculiarities of the auditory system. ReplayGain attempts to model those peculiarities with the following measurement procedure.

Loudness filter

Figure 1: Loudness filter target response (blue), high-pass response (green) and composite response (red)

The human ear does not perceive sounds of all frequencies as having equal loudness. For example, a full-scale sine wave at 1 kHz sounds much louder than a full scale sine wave at 100 Hz, even though the two have identical energy. To account for this, the signal is filtered by an inverted approximation of the equal loudness curves (sometimes referred to as Fletcher–Munson curves) which describe the sensitivity of the ear as a function of frequency. The desired filter response derived from the equal loudness curves is shown in figure 1 (blue).

At higher frequencies a 10th order IIR filter designed by MATLAB's "yulewalk" function is an excellent approximation to the target. This is cascaded with a 2nd order Butterworth high pass filter, with a high pass frequency of 150 Hz (Figure 1 [green]). The resulting combined response (Figure 1 [red]) is close to the target response, and is used by ReplayGain.

Figure 2: IIR filter topology used by "yulewalk" and Butterworth filter components

The filter topology used for the components of the loudness filter is shown in figure 2. The filter coefficients for 48 and 44.1 kHz sample rates are given for the Butterworth and "yulewalk" components in tables 1 and 2 respectively. When using other sample rates, coefficients must be transformed to maintain the same filter response.

Table 1a: Butterworth filter coefficients (Fs=48 kHz)
b(0) 0.98621192462708
a(1) 1.97223372919527 b(1) -1.97242384925416
a(2) -0.97261396931306 b(2) 0.98621192462708
Table 1b: Butterworth filter coefficients (Fs=44.1 kHz)
b(0) 0.98500175787242
a(1) 1.96977855582618 b(1) -1.97000351574484
a(2) -0.97022847566350 b(2) 0.98500175787242
Table 2a: "Yulewalk" filter coefficients (Fs=48 kHz)
b(0) 0.03857599435200
a(1) 3.84664617118067 b(1) -0.02160367184185
a(2) -7.81501653005538 b(2) -0.00123395316851
a(3) 11.34170355132042 b(3) -0.00009291677959
a(4) -13.05504219327545 b(4) -0.01655260341619
a(5) 12.28759895145294 b(5) 0.02161526843274
a(6) -9.48293806319790 b(6) -0.02074045215285
a(7) 5.87257861775999 b(7) 0.00594298065125
a(8) -2.75465861874613 b(8) 0.00306428023191
a(9) 0.86984376593551 b(9) 0.00012025322027
a(10) -0.13919314567432 b(10) 0.00288463683916
Table 2b: "Yulewalk" filter coefficients (Fs=44.1 kHz)
b(0) 0.05418656406430
a(1) 3.47845948550071 b(1) -0.02911007808948
a(2) -6.36317777566148 b(2) -0.00848709379851
a(3) 8.54751527471874 b(3) -0.00851165645469
a(4) -9.47693607801280 b(4) -0.00834990904936
a(5) 8.81498681370155 b(5) 0.02245293253339
a(6) -6.85401540936998 b(6) -0.02596338512915
a(7) 4.39470996079559 b(7) 0.01624864962975
a(8) -2.19611684890774 b(8) -0.00240879051584
a(9) 0.75104302451432 b(9) 0.00674613682247
a(10) -0.13149317958808 b(10) -0.00187763777362

Input samples from the audio file to be analysed must be run in cascade manner through both of these filter components before being analysed further.

RMS level calculation

Next, the energy during each moment of the signal is determined by calculating the Root Mean Square (RMS) of the filtered signal every 50ms.[2]

The signal is chopped into 50ms long blocks. Then, for each block:[3]

  1. Every sample value is squared (multiplied by itself).
  2. The mean average is taken.
  3. The square root of the average is calculated.

For stereo signals, in step 3, the mean average of all squared samples from both channels over the 50ms measurement interval is taken.[4]

The result of this calculation is then converted to a decibel representation as follows:

Where:

is the RMS value calculated above
is the maximum peak-to-peak range of the samples in the audio file

Statistical processing

Where the average energy level of a signal varies with time, the louder moments contribute most to perception of overall loudness. For example, in human speech, over half the time is silence, but the perceived loudness of speech is primarily determined by the levels between silences.

A good method to determine the overall perceived loudness is to sort the RMS values into numerical order, and then pick a value near the top of the list. For highly compressed pop music (e.g. Figure 5(c), where there are many values near the top), the choice makes little difference. For speech and classical music (Figures 5(a) and 5(b) respectively), the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is 95%,[5] so this value is used by ReplayGain.

Reference level

The audio industry does not have a standard for playback system calibration, but in the movie industry a calibration standard has been defined by the Society of Motion Picture and Television Engineers (SMPTE).[6] The standard states that a single channel pink noise signal with an RMS level of -20 dB relative to a full-scale sinusoid[7] should be reproduced at 83 dB SPL.[8]

ReplayGain adapts the SMPTE calibration concept for music playback. Under ReplayGain, audio is played so that its loudness, as measured using the procedures described in Loudness measurement above, matches the loudness of a pink noise signal with an RMS level of -14 dB relative to a full-scale sinusoid,[9] also measured using the procedures described above.

In ReplayGain implementations, the reference level is described in terms of the SMPTE SPL playback level. By the SMPTE definition, the 83 dB SPL reference corresponds to -20FS dB system headroom. The -14 dB headroom used by ReplayGain therefore corresponds to an 89 dB SPL playback level on a SMPTE calibrated system and so is said to be operating with an 89 dB reference level.

SMPTE cinema calibration calls for a single channel of pink noise reproduced through a single loudspeaker. In music applications, the ideal level of the music is actually the loudness when both speakers are in use. So, ReplayGain is calibrated to two channels of pink noise.[10]

Gain calculation

RG achieves loudness compensated playback by applying gain (or attenuation) dependent on the measured loudness of the audio file relative to the established reference level. The gain is calculated as follows:

Where all quantities are expressed in decibels:

is the replay gain adjustment,
is the measured loudness of the -14 dB pink noise reference and
is the measured loudness of the audio file.

Replay gain is positive if the loudness of the audio file is lower than the pink noise reference. The gain is negative (representing an attenuation) if the loudness of the audio file is higher than that of the reference. The gain is stored as metadata with the audio file as described below and is used by players to adjust output volume of tracks as they are played as described in Player requirements below.

Metadata

For ReplayGain to do its work during playback, four values must be stored as metadata[11] with or within the audio file:

  1. Peak track amplitude
  2. Peak album amplitude
  3. Track replay gain
  4. Album replay gain

If calculated for an individual track, the loudness measurement (as specified above) yields track replay gain. If calculated on an album basis, with all tracks concatenated to make one long audio file, the loudness measurement yields album replay gain.

Replay gain

Under some listening conditions, it's useful to have every track sound equally loud. The problem with a track-by-track approach is that tracks which should be quiet in the context of the album on which they reside will be brought up to the level of all the rest. For casual listening, or in a noisy background, this can be a good thing. For serious listening, it does not respect the intent of the artist or mastering engineer; a tender ballad track will be blasting at the same loudness as a hard rock track on the same album. It's generally ideal to leave the intentional loudness differences between tracks in place, yet still correct for unmusical and annoying loudness differences between albums. To accomplish this, ReplayGain suggests that two different gain adjustments should be stored as metadata with each sound file.

Album replay gain represents the ideal listening gain for an entire album. ReplayGain reads the collection of tracks that comprise a album, and calculates a single replay gain for the whole set. This single gain can be used for playback of all tracks of the album. Intentionally quiet tracks then stay appropriately quieter than the rest. It still solves the basic problem (annoying, unwanted level differences between discs) because quiet or loud discs are still adjusted overall—so the pop CD that's 20 dB louder than the classical CD will be brought into line.

Peak amplitude

Scanning a track or album for the peak amplitude can be a time-consuming process. Therefore, it's helpful if this single value is stored as metadata. This is used to predict whether the required replay gain adjustment will cause clipping during playback.

The maximum peak amplitude value is stored as a floating point number, where 1.0 represents digital full scale. As with replay gain values, separate peak amplitude values are stored per track and per album.

For uncompressed files simply, scanners store the maximum absolute sample value held in the file on any channel for positive or negative excursion. The single sample value should be converted to a floating-point representation, such that digital full scale is equivalent to a value of 1.0.

Psychoacoustically coded audio, such as MP3, does not exist as a sequence of samples until it is decoded. Psychoacoustic coding of a heavily limited file can lead to sample values larger than digital full scale upon decoding. The coded files must be decoded using a fully compliant decoder that allows peak overflows (i.e. has headroom) and may result in peak amplitude values greater than 1.0.

Metadata format

From the standpoint of metadata storage, each audio file format presents a unique situation. There are three favored schemes defined for storage of ReplayGain metadata: ID3v2, Vorbis comments and APEv2. A survey of file formats is listed below with metadata schemes in order of preference for each:

  • .aac (Advanced Audio Coding raw format) – No metadata support (use .mp4 instead)
  • .aiff, .aif, .aifc (Apple Interchange File Format) – ID3v2 (in "ID3" IFF chunk)
  • .ape, .apl (Monkey's Audio) – APEv2
  • .bwf (Broadcast Wave Format) – ID3v2 (in RIFF chunk)
  • .flac (Free Lossless Audio Codec) – Vorbis comments
  • .mp3 (MPEG audio layer 3) – ID3v2, LAME VBR proposed tag specification
  • .mp4 also .m4a, .m4b, .m4p, m4r (MPEG-4 Part 14) – ID3v2 (in "ID32" box)
  • .mpc (Musepack) – APEv2
  • .ogg (Ogg Vorbis) – Vorbis comments
  • .tta (True Audio) – ID3v2, APEv2
  • .wma (Windows Media audio) - Vorbis comments in Extended Content Description Object
  • .wav (Windows PCM) – No metadata support (use .bwf instead)
  • .wv (WavePak) – APEv2

ID3v2

The ID3v2 standard[12] defines a tag which is situated before the data in an MP3 file.[13] ID3 is used primarily with MP3 audio files but means of adapting the system to other file types have been developed.

The ID3v2 tag is divided into frames. The preferred means of storing ReplayGain metadata is use of TXXX key/value pair frames. Two other legacy schemes for storing ReplayGain metadata exist: RGAD and RVA2. These formats are documented in the appendix. Players may choose to look for these formats if metadata in the TXXX format is not found in the ID3v2 tag. New scanners may write these older formats in addition to the newer (TXXX) ones if they wish to remain backwards compatible with older players.

ReplayGain uses four TXXX frames. The header of a TXXX frame is coded as follows:

Frame ID       $54 58 58 58 ("TXXX") 
Size           $xx xx xx xx (size of frame excluding this header)
Flags          $40 $00      (discard frame if audio data is altered)

Frame data is coded as follows:

Text encoding  $00          (ISO-8859-1 encoding)
Description    <key string> $00
Value          <value string>

The four frames associated with ReplayGain metadata use the following key/value pairs

Table 3: Metadata keys and value formatting
Metadata Key Value format
Track replay gain REPLAYGAIN_TRACK_GAIN [-]a.bb dB
Peak track amplitude REPLAYGAIN_TRACK_PEAK c.dddddd
Album replay gain REPLAYGAIN_ALBUM_GAIN [-]a.bb dB
Peak album amplitude REPLAYGAIN_ALBUM_PEAK c.dddddd

Gains are specified textually in decibels. Negative gains (attenuation) are prefixed with a '-'. Positive gains have no prefix. Integral portion of the gain (a) may be one or two numeric (0-9) digits. If there is no integral portion the field is '0'. The decimal portion of the gain (bb) is two numeric digits. Gains are suffixed with a space followed by 'dB'.

Peak levels are specified textually as a positive decimal. Peak level is a dimensionless quantity with 1.000000 representing full scale. No suffix is included on peak values. The integer field (c) is typically 1 or 0. Six numeric digits in the decimal field (dddddd) is adequate to accurately represent peak values for 16-bit audio data.

A robust player should be prepared to parse the following variations in either replay gain or peak level metadata:

  • Positive gains with leading '+'
  • More or fewer significant digits than specified in any field
  • Leading zeros or spaces in integer fields
  • Missing or malformed 'dB' suffix (e.g. no space between numeric digits and suffix, alternate capitalization)
  • Alternate capitalization of keys

Other formatting errors indicate more severe problems and should result in player ignoring data as if the frame did not exist.

Vorbis comments

A Vorbis comment[14] uses an ASCII key=value format. When Vorbis comments are used, the four ReplayGain metadata items are stored as separate comments. The keys and formatting for values is the same as specified for ID3v2. Keys and values are required by the Vorbis comment specification to be separated by '=' (equal character).

APEv2

The APEv2 metadata format[15] also organizes data into key/value pairs. Keys are ASCII format. A flags field allows support for several value formats including UTF-8 and binary. Under APEv2, ReplayGain meta data is stored using the same keys and data as ASCII values in the same format as specified for ID3v2.

Player requirements

Figure 8: Example ReplayGain control panel

Loudness normalization, pre-amplification and clipping prevention are the operations performed by a ReplayGain player.

Loudness normalization

To properly normalize loudness, the player needs to determine if the user desires Track style level normalization (all tracks same loudness), or Album style level normalization (all albums same loudness, tracks of an album played at the same relative level as on the original release). This option should be selectable in the ReplayGain control panel (Figure 8). The player reads the corresponding gain metadata value from the file and scales the audio data as appropriate. Scaling the audio data simply means multiplying each sample value by a constant value. This constant is given by:

Or, in words, ten raised to the power of one-twentieth of replay gain.[16]

If the file only contains one of the replay gain adjustments (e.g. Album) but the user has requested the other (Track), then the player should use the one that is available (in this case, Album). If neither (Track or Album) gain metadata is available, then the player needs to choose a suitable default gain. Potential choices include unity gain (0 dB) or an average of gains from other tracks in the album or playlist.

Pre-amplification

Although the calibration level used by ReplayGain suggests that the average level of an audio track should be 14 dB below full scale, some pop music is dynamically compressed to peak at 0 dB and average around 3 dB below full scale. This means that, when the replay gain is applied, the level of such tracks will be reduced by 11 dB! If users are listening to a mixture of highly compressed and more dynamic tracks, ReplayGain will make the listening experience more pleasurable by bringing the level of the compressed tracks down into line with that of the others. However, if users are only listening to highly compressed music, then they may complain that all their files are now too quiet.[17]

To address this problem, a pre-amp feature should be incorporated into the player. A user-supplied pre-amp setting is an adjustment to the calculated replay gain. It should default to perform no adjustment. This means that casual users will experience a moderate reduction in the loudness of their compressed pop music. Less-compressed material can generally be played at the same loudness without clipping. Normalization of more dynamic material may cause clipping or invoke the clipping prevention mechanism (see below). Power users and audiophiles can reduce the pre-amp gain to enjoy the full dynamic range of all of their music.

If enabled, the player should read the user selected pre-amp gain, and scale the audio signal by the appropriate amount. For example, a +6 dB gain requires a scale of 106/20, which is approximately 2. The replay gain and pre-amp scale factors can be combined[18] for simplicity and ease of processing.

Clipping prevention

ReplayGain's suggestion of a -14 dB average playback level leaves sufficient headroom for the bulk of modern recordings. Nevertheless, there exists the possibility that after application of replay gain and pre-amp adjustment, a track may exceed full scale during its dynamic peaks. Without intervention, this will result in clipping, a severe form of distortion. Factors introducing the possibility of clipping include:

  1. Recordings from certain genres and certain periods in the history of commercial recordings require additional headroom. Although these recordings can be accommodated through a downwards adjustment of the pre-amp setting, it may be difficult to determine a safe adjustment and it may be undesirable to lower average level to accommodate the rare track which requires it.
  2. ReplayGain will make loud dynamically compressed tracks quieter, and quiet dynamically uncompressed tracks louder. The average levels will then be similar, but the quiet tracks will actually have louder peaks. If the user pushes the pre-amp gain upwards the peaks of the (originally) quieter tracks will be pushed well over full scale.
  3. In coded audio (e.g. MP3 files) a file that was hard-limited to digital full scale before encoding will often be pushed over the limit by the psychoacoustic compression. A decoder with headroom can recover the over full scale signal by reducing the gain.

ReplayGain suggests two possible solutions which prevent clipping in these situations. A player should support one or both of these.

Audio limiting

In situation 2 above, the user clearly wants all the music to sound very loud. To give them their wish, any signal which would peak above digital full scale should be hard limited at just below digital full scale. This is also useful at lower pre-amp gains, where it allows the average level of classical music to be raised to that of pop music, without distorting. The exact type of nature limiting or compression an implementation choice for the player.[19]

Reduced gain

The audiophile user will not want any compression or limiting on the signal. In this case the only option is to automatically and temporarily reduce the pre-amp gain below the user-selected setting for tracks where clipping would otherwise occur. Clipping can be predicted by examining the peak level of the track or album being played.

The player must read the peak amplitude metadata. If peak level metadata is unavailable, the player should assume a peak level of 1.0. If the peak level for both track and album is stored as metadata in the file, it is possible to calculate if, following the replay gain adjustment and pre-amp gain, the signal will clip at some point. If it won't, then no further action is necessary.

An overall scale factor for loudness normalization taking into account replay gain, pre-amp setting and clipping prevention through gain reduction is given below.

Hardware implementation

The above three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. The clipping problem can be addressed by providing adequate headroom in the analog circuitry. Bit transparency and maximum signal to noise ratio is maintained in the digital signal and DAC process.[20]

Acknowledgements

The original ReplayGain proposal (an archive is also available) was developed by David Robinson and was published 10 July 2001. Additional updates were published by David Robinson through 10 October 2001.

The following acknowledgement was included with the original proposal, "The algorithm to calculate an ideal replay gain has grown from my research into human hearing, with many additional ideas drawn from the work of E. Zwicker, and Brian Moore. I am currently completing my PhD at the University of Essex, and have been funded by the EPSRC." Additionally David Robinson credited Glen Sawyer (Snelg) and Jim Casaburi (Walrus) for software contributions and Bob Katz and Matt Ashland for ideas.

This updated ReplayGain specification reflecting current and recommended practice was prepared by Kevin Gross in 2011.

Contact

For ReplayGain-related questions or contributions, please post in the General Audio section of the Hydrogen Audio forums.

Appendix

  1. ReplayGain legacy metadata formats

Notes

  1. Source: Wikipedia - Loudness war
  2. The block length of 50ms was chosen after studying the effect of values between 25ms and 1s. 25ms was too short to accurately reflect the perceived loudness of some sounds. Beyond 50ms there was little change (after statistical processing). For this reason, 50ms was chosen.
  3. If these steps are read backward, it should be clear why the process is called Root Mean Square averaging.
  4. One could sum channels of a stereo signal to mono before calculating the RMS level, but then any out-of-phase components (having the opposite signal on each channel) would cancel out to zero (i.e. silence). That's not how humans perceive them, so it's not a good solution.
  5. Based on experiments performed by David Robinson, "I tried values from 70% to 95%. For highly compressed pop music, the choice makes little difference. For speech and classical music, the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is around 95%, so this value is used by Replay Level."
  6. SMPTE RP 200:2002 – Relative and Absolute Sound Pressure Levels for Motion-Picture Multichannel Sound Systems – Applicable for Analog Photographic Film Audio, Digital Photographic Film Audio and D-Cinema
  7. "dB relative to a full-scale sinusoid" is preferred over "dBFS" as a unit of measure in this specification because there is some ambiguity whether the reference for dBFS is a full-scale square wave (peak reference) or a sine wave (RMS reference).
  8. Measured using a C-weighted, slow averaging SPL meter.
  9. The initial ReplayGain proposal used the same -20 dB reference used by SMPTE. The reference was raised to -14 dB early on in ReplayGain development. This reference is used in all current ReplayGain implementations.
  10. In reality, a monophonic pink noise wave file is used, and ReplayGain automatically assumes the file is being played through both speakers, as would any monophonic file.
  11. Metadata is "data about data." For example, the ID3 de facto standard provides a way to store artist, title, album title, track number, and other metadata in data blocks called "tags" immediately before or after the audio data in an MP3 file. Other metadata storage/tagging standards and conventions exist for other audio file formats.
  12. The ID3v2 format is explained at www.id3.org. The most useful document is the ID3v2 v2.3.0 standard. Although this document has been superseded by v2.4.0, the earlier document is complete (rather than an update), and in indexed HTML form. As such, it represents a better technical introduction to ID3v2.
  13. The original ID3 (v1) tags resided at the end of the file, and contained a few fields of information. The ID3v1 tag is not extensible and therefore cannot support ReplayGain metadata.
  14. Vorbis comment metadata format. ReplayGain metadata is documented on the Xiph Wiki.
  15. APEv2 Specification at Hydrogen Audio Wiki
  16. After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard—not the file (i.e. after ReplayGain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).
  17. This problem can be especially noticeable on portable players with limited output or gain.
  18. Scale factors in Decibel units are added to produce the same effect as multiplying scale factors in linear units.
  19. Something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate for pop music at least.
  20. A system using today's 24-bit converters is unlikely to appreciate any overall gain in system performance with such an arrangement. A digitally-controlled analog gain element typically introduces significant noise and distortion.

See also

This is not a normative part of the specification.