Joint stereo refers to any stereo-encoding method that goes beyond simple encoding as two independent channels ("simple" or "L/R" stereo or DualMono). These methods exploit the similarities between channels and typically allow for more bits to be effectively used, increasing audio quality for a given bitrate. They are, however, not guaranteed to be perfect and could instead cause audible artifacts (mostly on older encoders).
Some file formats, such as MP3, can do switch among these formats on-the-fly on a frame or sub-frame basis, for the sake of efficiency or quality. For example, a high-bitrate "joint stereo" MP3 file may contain a mixture of SS and MS frames, or it may contain all SS frames or all MS frames. Due to some historical accident, the term as applied in MP3 refers to a mixture of coding formats. In other words, a non-"joint stereo" MP3 will never contain a mixture of frame types.
Stereo coding methods or "modes"
Left-Right (L/R) or "Simple" Stereo (SS)
Simple stereo is the most straightforward method of coding a stereo signal: each channel is treated as a completely separate entity. This can be inefficient and may adversely impact quality (as compared to other modes) when both channels contain nearly identical signals (i.e., are mono or nearly so).
Mid-side Stereo (MS)
Mid-side stereo coding calculates a "mid"-channel by addition of left and right channel, and a "side"-channel by subtraction, i.e.:
- M = (L + R) / 2, S = (L - R) / 2
- L = M + S, R = M - S
Whenever a signal is concentrated in the middle of the stereo image (i.e. more mono-like), mid-side stereo can achieve a significant saving in bitrate, since one can use fewer bits to encode the side-channel. Even more important is the fact that by applying the inverse matrix in the decoder, the quantization noise becomes correlated and falls in the middle of the stereo image, where it is masked by the signal.
Unlike intensity stereo which destroys phase information, mid-side coding is mathematically lossless (although subsequent lossy compression may cause phase degredation). Correctly implemented mid-side stereo does very little or no damage to the stereo image and increases compression efficiency either by reducing size or increasing overall quality. Mid-side is also simple enough to be implemented in FM radio and stereophonic Vinyl.
Mid-side stereo can use coefficients other than 1 in encoding and decoding. Allowing different contributions from each channel allows the codec to adapt to off-balance sources and retain the bitrate savings. This extension is found in opus, where an angle can be encoded.
Intensity stereo coding is a method that achieves a saving in bitrate by replacing the left and the right signal by a single representing signal plus directional information (in the form of amplitude ratios for each frequency range). This replacement is psychoacoustically justified in the higher frequency range since the human auditory system is insensitive to the signal phase at frequencies above approximately 2 kHz. To maintain the justification, a codec may only apply intensity stereo to higher-frequency parameters.
Intensity stereo is by definition a lossy coding method thus it is primarily useful at low bitrates. For coding at higher bitrates only mid-side stereo should be used.
Parametric stereo, found in HE-AAC, is similar to intensity stereo, except that the directional information also includes phase and correlation. The phase information makes this algorithm also capable of keeping low frequency location cues (by inter-aural time differences), while the (de-)correlation information helps add ambience by synthesizing some difference between channels.
PS replaces a whole channel with only 2-3 kbit/s of side information. As a result, the remaining channel gets almost double the bitrate to use, so the quality gain can more than makes up for the lossiness of the process. It is not useful at high bitrate.
The phase aspect is covered by a few patents applied in 1997~2000 (EP1107232A3, EP0797324A2), which should have expired. The ambience part (EP1927266B1) will expire in 2026, so do not expect any new experimental codec to use it yet.
The general idea of exploiting the redundancy among channels is called channel coupling.
Surround is structured like stereo in some ways, except now there are many more pairs that can be coupled together. The basic approach is to code together the corresponding pairs of left and right using ordinary joint stereo techniques.
In MPEG Surround, a process similar to parametric stereo is used to three streams into two, or two streams into one – plus a small stream of side information. A stream created by merging itself can be merged, creating a hierarchy of merges. For example, a 5.1 stream can be encoded as merges of C/LFE, L/Ls, R/Rs, then these three streams can be mixed down if needed.
Ambisonics represents an entire sound field. In the raw representation, everything is based on spherical harmonics.
- Multi-mono lossy encoding is unacceptably bad for ambisonics. Each stream does its own thing with the phase, resulting in a incoherent sound image.
- A fixed encoding matrix, such as the one in Opus, is passable. Sources a fixed direction gets much better quality (because it only goes in one stream: no chance of phase inconsistencies), and if the underlying codec is given enough bitrate to not mess with phase too much, the rest can be okay too.
- MPEG-H 3D Audio isolates each source in space from the input, storing a representation based on objects. This should not have any preferred direction.
MP3 supports dual-mono, M/S, and intensity methods. LAME does not support intensity stereo.
Some early MP3 encoders didn't make ideal decisions about what mode to use from frame to frame in joint stereo files, or how much bandwidth to allocate to encoding the side channel. This led to a widespread but mistaken belief that an abundance of M/S frames, or the use of joint stereo in general, always negatively impacts channel separation and other measures of audio quality. This is not an issue with modern encoders. Modern, optimized encoders will switch between mid-side coding or simple stereo coding as necessary, depending on the correlation between the left and right channels, and will allocate channel bandwidth appropriately to ensure the best mode is used for each frame.
LAME M/S is known to better preserve stereo image than dual-mono in most circumstances, given the same bitrate budget. See Lossy.
Vorbis treats stereo information with square polar mapping which is beneficial when the correlation between the left and right channels are strong (this can also be extended to multichannel coupling as well). In Vorbis, the spectrum of each channel is normalized against a floor function, which is a rough envelope of the actual spectrum. In the square polar mapping, the (stereo) phase is roughly defined as the difference between the normalized left and right amplitude of a given frequency component. If the original left and right channel are the same within a certain frequency band, apart from an overall scaling factor, then the normalized frequency spectrum is the same left and right and the stereo phase is zero over the whole frequency band. Note that in the context of polar mapping, the term 'phase' (here: 'stereo phase') has a very different meaning from the phase of a periodic wave. Unlike in the Fourier Transform, the Cosine Transform used in Vorbis and other encoders only provides amplitudes and no phases of the latter type.
Once the stereo information is represented in polar mapping as a magnitude and stereo phase, Vorbis can use three coupling methods:
- Lossless coupling is mathematically equivalent to independent encoding of the two channels ('dual mono' in MP3), but with the benefit of additional space-saving. It does polar mapping/channel interleaving using the residue vectors.
- In phase stereo, the stereo phase is quantized, i.e. stored at a lower resolution. Especially above 4 kHz, the ear is not very sensitive to phase information. Phase stereo is not currently implemented in reference encoder due to complexity, but will be re-added again later on. Note that phase stereo should not be compared to intensity stereo in MP3 coding.
- In point stereo, the stereo phase is discarded completely. All the stereo information comes from the difference in the spectral floors for the left and right channels.
Ogg Vorbis uses lossless/point stereo coupling below -q 6. Lossless channel coupling is used for high bitrates entirely (-q 6 and up). This can be adjusted via an advanced-encode switch, but is not done for simplicity's sake.
Opus is capable of multi-mono, M/S with tunable weight factor, and intensity stereo. It avoids multi-mono unless explicitly asked for, and decide among M/S and intensity by the bitrate available and audio content. It also calculates the stereo width to decide the total amount of bitrate needed.
With surround input, Opus can only couple to pairs of joint-stereo. It does take advantage of surround masking.
With ambisonic input, Opus can use a fixed matrix, or do multi-mono.
- See e.g. https://web.archive.org/web/20180714000735/http://jmvalin.ca/papers/aes135_opus_celt.pdf, sections 4.5 [IS frequency], 4.5.1 [M/S angle]
- Purnhagen, Heiko (October 5–8, 2004). LOW COMPLEXITY PARAMETRIC STEREO CODING IN MPEG-4" (PDF). 7th International Conference on Digital Audio Effects: 163–168.
- Phase/ambisonic issue discussed in: Mahé, Pierre; Ragot, Stéphane; Marchand, Sylvain (2 September 2019). First-Order Ambisonic Coding with PCA Matrixing and Quaternion-Based Interpolation. 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK. p. 284.
- Ogg Vorbis stereo-specific channel coupling at xiph.org.