ReplayGain 2.0 specification

This is a proposed update to the ReplayGain 1.0 specification. This proposal is currently Under Construction. Please discuss this proposal on the discussion page or the General Audio forum. --Notat 23:42, 8 October 2012 (CEST)

Although music is encoded to a digital format with a clearly defined maximum peak amplitude, and although most recordings are normalized to utilize this peak amplitude, not all recordings sound equally loud. This is because once this peak amplitude is reached, perceived loudness can be further increased through signal-processing techniques such as dynamic range compression and equalization.^[1] Therefore, the loudness of a given album has more to do with the year of issue or the whim of the producer than the intended emotional effect. Because of this, a random play through a music collection can have one leaping for the volume control every other track.

There is a solution to this annoyance: within each audio file, information can be stored about what volume change would be required to play each track or album at a standard loudness, and players can use this "replay gain" information to automatically nudge the volume up or down as required.

The ReplayGain specification is a standard which defines an appropriate reference level, explains a way of calculating and representing the ideal replay gain for a given track or album, and provides guidance for players to make the required volume adjustment during playback. The standard also specifies a means to prevent clipping when the calculated replay gain exceeds the limits of digital audio, and it describes how the replay gain information is stored within audio files.

Loudness measurement

Loudness is a subjective measure of the intensity of sound. The correlation of perceived loudness to sound pressure level is determined by the peculiarities of the auditory system.

The original ReplayGain 1.0 specification described a loudness measurement system which included a weighting filter, root mean square (RMS) measurement and statistical processing that model human perception of loudness in both the frequency and time domains.

Since original ReplayGain proposal in 2001, the science, practice and standards for loudness normalization have been advanced significantly. The current industry standard approach to loudness measurement is described by the International Telecommunications Union^[2] (ITU) as BS.1770. The most recent version of this standard is known as ITU BS.1770-3^[3] and was published in August 2012. The ITU work is freely available and is not believed to be encumbered by any patent issues. The ITU BS.1770-2 standard has been adopted in the United States by the ATSC as A/85 and in Europe by the European Broadcast Union as EBU R-128 for broadcast audio.

BS.1770-3 uses a "K-weighted" RMS measurement. This weighting function is significantly less complex than the inverted Fletcher-Munson weighting used by RG1. A gating function designed measure the loudness of foreground components in the audio program. The gate in BS.1770 performs a similar function as the statistical processing in the original RG1 specification.

The computation required for BS.1770-3 loudness measurement is reduced compared to the RG1 technique. Nevertheless, BS.1770 has been shown in several academic studies to be equally or more effective than the RG1 algorithm in modelling human loudness perception on music program as well as other material such as podcasts, television programs and movies.^[4]^[5]^[6]

RG2 uses BS.1770-3 for loudness measurement. It is expected the ITU standard will evolve over time to meet the needs of broadcasters and governments. It is the intent of the ReplayGain community that RG2 follow any future backwards-compatible improvements to loudness measurement using the BS.1770 standard.

Reference level

RG1 is calibrated to a pink noise reference signal with a RMS level 14 dB below a full-scale sinusoid. This reference signal is used to establish a reference level. ReplayGain will apply no gain or attenuation to the reference signal or any program material which has the same loudness measurements as the reference signal.

BS-1770 defines a loudness scale for program material. The units of BS.1770 loudness measurements are in Loudness Units [relative to] Full Scale (LUFS). LUFS can be treated like decibels.

The loudness measurement of the RG1 reference signal is -18 LUFS. In order to maintain backwards compatibility with RG1, RG2 uses a -18 LUFS reference.

Gain calculation

RG achieves loudness compensated playback by applying gain (or attenuation) dependent on the measured loudness of the audio file relative to the established reference level. The gain is calculated as follows:

RG=L_{r}-L

Where:

RG

is the replay gain adjustment in decibels,

L_{r}

is the -18 LUFS reference level

L

is the measured loudness of the audio file in LUFS.

Replay gain is positive if the loudness of the audio file is lower than the reference level. The gain is negative (representing an attenuation) if the loudness of the audio file is higher than the reference level. The gain is stored as metadata with the audio file as described below and is used by players to adjust output volume of tracks as they are played as described in Player requirements below.

Metadata

For ReplayGain to do its work during playback, four values must be stored as metadata^[7] with or within the audio file:

Peak track amplitude
Peak album amplitude
Track replay gain
Album replay gain

If calculated for an individual track, the loudness measurement (as specified above) yields track replay gain. If calculated on an album basis, with all tracks concatenated to make one long audio file, the loudness measurement yields album replay gain.

Replay gain

Under some listening conditions, it's useful to have every track sound equally loud. The problem with a track-by-track approach is that tracks which should be quiet in the context of the album on which they reside will be brought up to the level of all the rest. For casual listening, or in a noisy background, this can be a good thing. For serious listening, it does not respect the intent of the artist or mastering engineer; a tender ballad track will be blasting at the same loudness as a hard rock track on the same album. It's generally ideal to leave the intentional loudness differences between tracks in place, yet still correct for unmusical and annoying loudness differences between albums. To accomplish this, ReplayGain suggests that two different gain adjustments should be stored as metadata with each sound file.

Album replay gain represents the ideal listening gain for an entire album. ReplayGain reads the collection of tracks that comprise a album, and calculates a single replay gain for the whole set. This single gain can be used for playback of all tracks of the album. Intentionally quiet tracks then stay appropriately quieter than the rest. It still solves the basic problem (annoying, unwanted level differences between discs) because quiet or loud discs are still adjusted overall—so the pop CD that's 20 dB louder than the classical CD will be brought into line.

Peak amplitude

Scanning a track or album for the peak amplitude can be a time-consuming process. Therefore, it's helpful if this single value is stored as metadata. This is used to predict whether the required replay gain adjustment will cause clipping during playback.

The maximum peak amplitude value is stored as a floating point number, where 1.0 represents digital full scale. As with replay gain values, separate peak amplitude values are stored per track and per album.

For uncompressed files simply, scanners store the maximum absolute sample value held in the file on any channel for positive or negative excursion. The single sample value should be converted to a floating-point representation, such that digital full scale is equivalent to a value of 1.0.

Psychoacoustically coded audio, such as MP3, does not exist as a sequence of samples until it is decoded. Psychoacoustic coding of a heavily limited file can lead to sample values larger than digital full scale upon decoding. The coded files must be decoded using a fully compliant decoder that allows peak overflows (i.e. has headroom) and may result in peak amplitude values greater than 1.0.

Metadata format

From the standpoint of metadata storage, each audio file format presents a unique situation. There are three favored schemes defined for storage of ReplayGain metadata: ID3v2, Vorbis comments and APEv2. A survey of file formats is listed below with metadata schemes in order of preference for each:

.aac (Advanced Audio Coding raw format) – No metadata support (use .mp4 instead)
.aiff, .aif, .aifc (Apple Interchange File Format) – ID3v2 (in "ID3" IFF chunk)
.ape, .apl (Monkey's Audio) – APEv2
.bwf (Broadcast Wave Format) – ID3v2 (in RIFF chunk)
.flac (Free Lossless Audio Codec) – Vorbis comments
.mp3 (MPEG audio layer 3) – ID3v2, LAME VBR proposed tag specification
.mp4 also .m4a, .m4b, .m4p, m4r (MPEG-4 Part 14) – ID3v2 (in "ID32" box)
.mpc (Musepack) – APEv2
.ogg (Ogg Vorbis) – Vorbis comments
.tta (True Audio) – ID3v2, APEv2
.wma (Windows Media audio) - Vorbis comments in Extended Content Description Object
.wav (Windows PCM) – No metadata support (use .bwf instead)
.wv (WavePak) – APEv2

ID3v2

The ID3v2 standard^[8] defines a tag which is situated before the data in an MP3 file.^[9] ID3 is used primarily with MP3 audio files but means of adapting the system to other file types have been developed.

The ID3v2 tag is divided into frames. The preferred means of storing ReplayGain metadata is use of TXXX key/value pair frames. Two other legacy schemes for storing ReplayGain metadata exist: RGAD and RVA2. These formats are documented in the appendix. Players may choose to look for these formats if metadata in the TXXX format is not found in the ID3v2 tag. New scanners may write these older formats in addition to the newer (TXXX) ones if they wish to remain backwards compatible with older players.

ReplayGain uses four TXXX frames. The header of a TXXX frame is coded as follows:

Frame ID       $54 58 58 58 ("TXXX") 
Size           $xx xx xx xx (size of frame excluding this header)
Flags          $40 $00      (discard frame if audio data is altered)

Frame data is coded as follows:

Text encoding  $00          (ISO-8859-1 encoding)
Description    <key string> $00
Value          <value string>

The four frames associated with ReplayGain metadata use the following key/value pairs

Table 3: Metadata keys and value formatting
Metadata	Key	Value format
Track replay gain	REPLAYGAIN_TRACK_GAIN	[-]a.bb dB
Peak track amplitude	REPLAYGAIN_TRACK_PEAK	c.dddddd
Album replay gain	REPLAYGAIN_ALBUM_GAIN	[-]a.bb dB
Peak album amplitude	REPLAYGAIN_ALBUM_PEAK	c.dddddd

Gains are specified textually in decibels. Negative gains (attenuation) are prefixed with a '-'. Positive gains have no prefix. Integral portion of the gain (a) may be one or two numeric (0-9) digits. If there is no integral portion the field is '0'. The decimal portion of the gain (bb) is two numeric digits. Gains are suffixed with a space followed by 'dB'.

Peak levels are specified textually as a positive decimal. Peak level is a dimensionless quantity with 1.000000 representing full scale. No suffix is included on peak values. The integer field (c) is typically 1 or 0. Six numeric digits in the decimal field (dddddd) is adequate to accurately represent peak values for 16-bit audio data.

A robust player should be prepared to parse the following variations in either replay gain or peak level metadata:

Positive gains with leading '+'
More or fewer significant digits than specified in any field
Leading zeros or spaces in integer fields
Missing or malformed 'dB' suffix (e.g. no space between numeric digits and suffix, alternate capitalization)
Alternate capitalization of keys

Other formatting errors indicate more severe problems and should result in player ignoring data as if the frame did not exist.

Vorbis comments

A Vorbis comment^[10] uses an ASCII key=value format. When Vorbis comments are used, the four ReplayGain metadata items are stored as separate comments. The keys and formatting for values is the same as specified for ID3v2. Keys and values are required by the Vorbis comment specification to b separated by '=' (equal character).

APEv2

The APEv2 metadata format^[11] also organizes data into key/value pairs. Keys are ASCII format. A flags field allows support for several value formats including UTF-8 and binary. Under APEv2, ReplayGain meta data is stored using the same keys and data as ASCII values in the same format as specified for ID3v2.

Player requirements

Figure 8: Example ReplayGain control panel

Loudness normalization, pre-amplification and clipping prevention are the operations performed by a ReplayGain player.

Loudness normalization

To properly normalize loudness, the player needs to determine if the user desires Track style level normalization (all tracks same loudness), or Album style level normalization (all albums same loudness, tracks of an album played at the same relative level as on the original release). This option should be selectable in the ReplayGain control panel (Figure 8). The player reads the corresponding gain metadata value from the file and scales the audio data as appropriate. Scaling the audio data simply means multiplying each sample value by a constant value. This constant is given by:

10^{\frac {gain}{20}}

Or, in words, replay gain divided by 20 all raised to the power of ten.^[12]

If the file only contains one of the replay gain adjustments (e.g. Album) but the user has requested the other (Track), then the player should use the one that is available (in this case, Album). If neither (Track or Album) gain metadata is available, then the player needs to choose a suitable default gain. Potential choices include unity gain (0 dB) or an average of gains from other tracks in the album or playlist.

Pre-amplification

Although the calibration level used by ReplayGain suggests that the average level of an audio track should be 14 dB below full scale, some pop music is dynamically compressed to peak at 0 dB and average around 3 dB below full scale. This means that, when the replay gain is applied, the level of such tracks will be reduced by 11 dB! If users are listening to a mixture of highly compressed and more dynamic tracks, ReplayGain will make the listening experience more pleasurable by bringing the level of the compressed tracks down into line with that of the others. However, if users are only listening to highly compressed music, then they may complain that all their files are now too quiet.^[13]

To address this problem, a pre-amp feature should be incorporated into the player. A user-supplied pre-amp setting is an adjustment to the calculated replay gain. It should default to perform no adjustment. This means that casual users will experience a moderate reduction in the loudness of their compressed pop music. Less-compressed material can generally be played at the same loudness without clipping. Normalization of more dynamic material may cause clipping or invoke the clipping prevention mechanism (see below). Power users and audiophiles can reduce the pre-amp gain to enjoy the full dynamic range of all of their music.

If enabled, the player should read the user selected pre-amp gain, and scale the audio signal by the appropriate amount. For example, a +6 dB gain requires a scale of 10^6/20, which is approximately 2. The replay gain and pre-amp scale factors can be combined^[14] for simplicity and ease of processing.

Clipping prevention

ReplayGain's suggestion of a -14 dB average playback level leaves sufficient headroom for the bulk of modern recordings. Nevertheless, there exists the possibility that after application of replay gain and pre-amp adjustment, a track may exceed full scale during its dynamic peaks. Without intervention, this will result in clipping, a severe form of distortion. Factors introducing the possibility of clipping include:

Recordings from certain genres and certain periods in the history of commercial recordings require additional headroom. Although these recordings can be accommodated through a downwards adjustment of the pre-amp setting, it may be difficult to determine a safe adjustment and it may be undesirable to lower average level to accommodate the rare track which requires it.
ReplayGain will make loud dynamically compressed tracks quieter, and quiet dynamically uncompressed tracks louder. The average levels will then be similar, but the quiet tracks will actually have louder peaks. If the user pushes the pre-amp gain upwards the peaks of the (originally) quieter tracks will be pushed well over full scale.
In coded audio (e.g. MP3 files) a file that was hard-limited to digital full scale before encoding will often be pushed over the limit by the psychoacoustic compression. A decoder with headroom can recover the over full scale signal by reducing the gain.

ReplayGain suggests two possible solutions which prevent clipping in these situations. A player should support one or both of these.

Audio limiting

In situation 2 above, the user clearly wants all the music to sound very loud. To give them their wish, any signal which would peak above digital full scale should be hard limited at just below digital full scale. This is also useful at lower pre-amp gains, where it allows the average level of classical music to be raised to that of pop music, without distorting. The exact type of nature limiting or compression an implementation choice for the player.^[15]

Reduced gain

The audiophile user will not want any compression or limiting on the signal. In this case the only option is to automatically and temporarily reduce the pre-amp gain below the user-selected setting for tracks where clipping would otherwise occur. Clipping can be predicted by examining the peak level of the track or album being played.

The player must read the peak amplitude metadata. If peak level metadata is unavailable, the player should assume a peak level of 1.0. If the peak level for both track and album is stored as metadata in the file, it is possible to calculate if, following the replay gain adjustment and pre-amp gain, the signal will clip at some point. If it won't, then no further action is necessary.

An overall scale factor for loudness normalization taking into account replay gain, pre-amp setting and clipping prevention through gain reduction is given below.

min(10^{\frac {RG+G_{pre-amp}}{20}},{\frac {1}{peakamplitude}})

Hardware implementation

The above three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. The clipping problem can be addressed by providing adequate headroom in the analog circuitry. Bit transparency and maximum signal to noise ratio is maintained in the digital signal and DAC process.^[16]

Acknowledgements

The original ReplayGain proposal (an archive is also available) was developed by David Robinson and was published 10 July 2001. Additional updates were published by David Robinson through 10 October 2001.

The following acknowledgement was included with the original proposal, "The algorithm to calculate an ideal replay gain has grown from my research into human hearing, with many additional ideas drawn from the work of E. Zwicker, and Brian Moore. I am currently completing my PhD at the University of Essex, and have been funded by the EPSRC." Additionally David Robinson credited Glen Sawyer (Snelg) and Jim Casaburi (Walrus) for software contributions and Bob Katz and Matt Ashland for ideas.

This updated ReplayGain specification reflecting current and recommended practice was prepared by Kevin Gross in 2011.

Contact

For ReplayGain-related questions or contributions, please post in the General Audio section of the Hydrogen Audio forums.

Appendix

ReplayGain legacy metadata formats

Notes

↑ Source: Wikipedia - Loudness war
↑ http://www.itu.int/en/Pages/default.aspx
↑ http://www.itu.int/rec/R-REC-BS.1770-3-201208-I/en
↑ Paul Nygren. Achieving equal loudness between audio files. KTH Royal Institute of Technology
↑ Martin Wolters; Harald Mundt; Jeffrey Riedmiller (May 2010). Loudness Normalization In The Age Of Portable Media Players. Audio Engineering Society.
↑ Esben Skovenborg; Søren H. Nielsen (October 2004). Evaluation of Different Loudness Models with Music and Speech Material. Audio Engineering Society. Archived from the original on 2012-02-08.
↑ Metadata is "data about data." For example, the ID3 de facto standard provides a way to store artist, title, album title, track number, and other metadata in data blocks called "tags" immediately before or after the audio data in an MP3 file. Other metadata storage/tagging standards and conventions exist for other audio file formats.
↑ The ID3v2 format is explained at www.id3.org. The most useful document is the ID3v2 v2.3.0 standard. Although this document has been superseded by v2.4.0, the earlier document is complete (rather than an update), and in indexed HTML form. As such, it represents a better technical introduction to ID3v2.
↑ The original ID3 (v1) tags resided at the end of the file, and contained a few fields of information. The ID3v1 tag is not extensible and therefore cannot support ReplayGain metadata.
↑ Vorbis comment metadata format. ReplayGain metadata is documented on the Xiph Wiki.
↑ APEv2 Specification at Hydrogen Audio Wiki
↑ After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard—not the file (i.e. after ReplayGain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).
↑ This problem can be especially noticeable on portable players with limited output or gain.
↑ Scale factors in Decibel units are added to produce the same effect as multiplying scale factors in linear units.
↑ Something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate for pop music at least.
↑ A system using today's 24-bit converters is unlikely to appreciate any overall gain in system performance with such an arrangement. A digitally-controlled analog gain element typically introduces significant noise and distortion.

[1] Source: Wikipedia - Loudness war

[2] ttp://www.itu.int/en/Pages/default.aspx

[3] ttp://www.itu.int/rec/R-REC-BS.1770-3-201208-I/en

[4] Paul Nygren. Achieving equal loudness between audio files. KTH Royal Institute of Technology

[5] Martin Wolters; Harald Mundt; Jeffrey Riedmiller (May 2010). Loudness Normalization In The Age Of Portable Media Players. Audio Engineering Society.

[6] Esben Skovenborg; Søren H. Nielsen (October 2004). Evaluation of Different Loudness Models with Music and Speech Material. Audio Engineering Society. Archived from the original on 2012-02-08.

[7] Metadata is "data about data." For example, the ID3 de facto standard provides a way to store artist, title, album title, track number, and other metadata in data blocks called "tags" immediately before or after the audio data in an MP3 file. Other metadata storage/tagging standards and conventions exist for other audio file formats.

[8] The ID3v2 format is explained at www.id3.org. The most useful document is the ID3v2 v2.3.0 standard. Although this document has been superseded by v2.4.0, the earlier document is complete (rather than an update), and in indexed HTML form. As such, it represents a better technical introduction to ID3v2.

[9] The original ID3 (v1) tags resided at the end of the file, and contained a few fields of information. The ID3v1 tag is not extensible and therefore cannot support ReplayGain metadata.

[10] Vorbis comment metadata format. ReplayGain metadata is documented on the Xiph Wiki.

[11] APEv2 Specification at Hydrogen Audio Wiki

[12] After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard—not the file (i.e. after ReplayGain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).

[13] This problem can be especially noticeable on portable players with limited output or gain.

[14] Scale factors in Decibel units are added to produce the same effect as multiplying scale factors in linear units.

[15] Something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate for pop music at least.

[16] A system using today's 24-bit converters is unlikely to appreciate any overall gain in system performance with such an arrangement. A digitally-controlled analog gain element typically introduces significant noise and distortion.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]