Joint stereo: Difference between revisions

From Hydrogenaudio Knowledgebase
(the "must read" that this article is based on is not perfect either! joint stereo means one-or-more stereo modes and does NOT exclude simple L/R mode)
 
(27 intermediate revisions by 6 users not shown)
Line 1: Line 1:
'''Joint stereo''' is a property of an audio data stream and means that the stream supports more than one method of stereo coding, such as SS ("simple" or "L/R" stereo), MS ("mid-side" stereo), or IS ("intensity" stereo). A joint stereo stream may still only employ a single coding method, but for the sake of efficiency or quality may switch between methods on a frame or even sub-frame basis.
'''Joint stereo''' refers to any stereo-encoding method that goes beyond simple encoding as two independent channels ("simple" or "L/R" stereo or DualMono). These methods exploit the similarities between channels and typically allow for more bits to be effectively used, increasing audio quality for a given bitrate. They are, however, not guaranteed to be perfect and could instead cause audible artifacts (mostly on older encoders).


For example, a high-[[bitrate]] "joint stereo" [[MP3]] file may contain a mixture of SS and MS frames, or it may contain all SS frames or all MS frames. A non-"joint stereo" MP3 will never contain a mixture of frame types.
Some file formats, such as MP3, can do switch among these formats on-the-fly on a frame or sub-frame basis, for the sake of efficiency or quality. For example, a high-[[bitrate]] "joint stereo" [[MP3]] file may contain a mixture of SS and MS frames, or it may contain all SS frames or all MS frames. Due to some historical accident, the term as applied in MP3 refers to a mixture of coding formats.  In other words, a non-"joint stereo" MP3 will never contain a mixture of frame types.
 
Very few MP3 encoders make ideal decisions about what mode to use from frame to frame in joint stereo files, so there has long been a misconception that joint stereo means too many (perhaps 100%) mid-side frames, which is rarely desirable in stereo music and can result in haphazard channel separation and/or suboptimal quality as measured by other criteria.
 
"Joint stereo coding methods" in prose generally refers to whatever alternatives to simple (L/R) stereo coding are supported by a particular format, even though simple stereo is also an option.


==Stereo coding methods or "modes"==
==Stereo coding methods or "modes"==
Line 13: Line 9:


===Mid-side Stereo (MS)===
===Mid-side Stereo (MS)===
Mid-side stereo coding calculates a "mid"-channel by addition of left and right channel, and a "side"-channel, i.e.:
Mid-side stereo coding calculates a "mid"-channel by addition of left and right channel, and a "side"-channel by subtraction, i.e.:


;Encoding
:''M'' = (''L'' + ''R'') / 2, ''S'' = (''L'' - ''R'') / 2
;Decoding
:''L'' = ''M'' + ''S'', R = ''M'' - ''S''


<center><math>Left = L \qquad Right = R\,</math></center>
Whenever a signal is concentrated in the middle of the stereo image (i.e. more mono-like), mid-side stereo can achieve a significant saving in bitrate, since one can use fewer bits to encode the side-channel. Even more important is the fact that by applying the inverse matrix in the decoder, the quantization noise becomes correlated and falls in the middle of the stereo image, where it is masked by the signal.


Unlike [[Joint stereo#intensity stereo|intensity stereo]] which destroys phase information, mid-side coding is mathematically lossless (although subsequent lossy compression may cause phase degredation). Correctly implemented mid-side stereo does very little or no damage to the stereo image and increases compression efficiency either by reducing size or increasing overall quality.  Mid-side is also simple enough to be implemented in FM radio and stereophonic Vinyl.


<center><math>Middle=\frac{L+R}{2} \qquad Side=\frac{L-R}{2}</math></center>
Mid-side stereo can use coefficients other than 1 in encoding and decoding. Allowing different contributions from each channel allows the codec to adapt to off-balance sources and retain the bitrate savings. This extension is found in opus, where an angle can be encoded.<ref name=opus/>


===Intensity stereo===


<center> <math>Left=Middle+Side \qquad Right=Middle-Side</math></center>
Intensity stereo coding is a method that achieves a saving in bitrate by replacing the left and the right signal by a single representing signal plus directional information (in the form of amplitude ratios for each frequency range). This replacement is psychoacoustically justified in the higher [[frequency]] range since the human auditory system is insensitive to the signal phase at frequencies above approximately 2 kHz.<ref>http://www.hydrogenaudio.org/forums/index.php?showtopic=1491&view=findpost&p=14091</ref> To maintain the justification, a codec may only apply intensity stereo to higher-frequency parameters.<ref name=opus>See e.g. https://web.archive.org/web/20180714000735/http://jmvalin.ca/papers/aes135_opus_celt.pdf, sections 4.5 [IS frequency], 4.5.1 [M/S angle]</ref>


Intensity stereo is by definition a [[lossy]] coding method thus it is primarily useful at low bitrates. For coding at higher bitrates only mid-side stereo should be used.


Whenever a signal is concentrated in the middle of the stereo image (i.e. more mono-like), mid-side stereo can achieve a significant saving in bitrate, since one can use fewer bits to encode the side-channel. Even more important is the fact that by applying the inverse matrix in the decoder, the quantization noise becomes correlated and falls in the middle of the stereo image, where it is masked by the signal.
===Parametric stereo===
Parametric stereo, found in HE-AAC, is similar to intensity stereo, except that the directional information also includes phase and correlation. The phase information makes this algorithm also capable of keeping low frequency location cues (by inter-aural time differences), while the (de-)correlation information helps add ambience by synthesizing some difference between channels.<ref name=LC-M4>Purnhagen, Heiko (October 5–8, 2004). [http://dafx.de/paper-archive/2004/P_163.PDF LOW COMPLEXITY PARAMETRIC STEREO CODING IN MPEG-4]" (PDF). 7th International Conference on Digital Audio Effects: 163–168.</ref>
 
PS replaces a whole channel with only 2-3 kbit/s of side information. As a result, the remaining channel gets almost double the bitrate to use, so the quality gain can more than makes up for the lossiness of the process. It is not useful at high bitrate.
 
The phase aspect is covered by a few patents applied in 1997~2000 (EP1107232A3, EP0797324A2), which should have expired. The ambience part (EP1927266B1) will expire in 2026, so do not expect any new experimental codec to use it yet.
 
== More channels ==
The general idea of exploiting the redundancy among channels is called ''channel coupling''.  
 
=== Surround ===
 
Surround is structured like stereo in some ways, except now there are many more pairs that can be coupled together. The basic approach is to code together the corresponding pairs of left and right using ordinary joint stereo techniques.
 
In MPEG Surround, a process similar to parametric stereo is used to three streams into two, or two streams into one &ndash; plus a small stream of side information. A stream created by merging itself can be merged, creating a hierarchy of merges. For example, a 5.1 stream can be encoded as merges of C/LFE, L/Ls, R/Rs, then these three streams can be mixed down if needed.
 
=== Ambisonics ===
 
Ambisonics represents an entire sound field. In the raw representation, everything is based on spherical harmonics.
 
* Multi-mono lossy encoding is unacceptably bad for ambisonics. Each stream does its own thing with the phase, resulting in a incoherent sound image.<ref>Phase/ambisonic issue discussed in: Mahé, Pierre; Ragot, Stéphane; Marchand, Sylvain (2 September 2019). ''[https://hal.science/hal-02289558 First-Order Ambisonic Coding with PCA Matrixing and Quaternion-Based Interpolation]''. 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK. p. 284.</ref>
* A fixed encoding matrix, such as the one in Opus, is passable. Sources a fixed direction gets much better quality (because it only goes in one stream: no chance of phase inconsistencies), and if the underlying codec is given enough bitrate to not mess with phase too much, the rest can be okay too.
* MPEG-H 3D Audio isolates each source in space from the input, storing a representation based on objects. This should not have any preferred direction.


Unlike [[Joint stereo#intensity stereo|intensity stereo]] which destroys phase information, mid-side coding keeps the phase information pretty much intact. Correctly implemented mid-side stereo does very little or no damage to the stereo image and increases compression efficiency either by reducing size or increasing overall quality.
== By format ==


===Intensity Stereo===
=== MP3 ===


Intensity stereo coding is a method that achieves a saving in bitrate by replacing the left and the right signal by a single representing signal plus directional information. This replacement is psychoacoustically justified in the higher [[frequency]] range since the human auditory system is insensitive to the signal phase at frequencies above approximately 2kHz.
MP3 supports dual-mono, M/S, and intensity methods. LAME does not support intensity stereo.


Intensity stereo is by definition a [[lossy]] coding method thus it is primarily useful at low bitrates. For coding at higher bitrates only mid-side stereo should be used.
Some early MP3 encoders didn't make ideal decisions about what mode to use from frame to frame in joint stereo files, or how much bandwidth to allocate to encoding the side channel. This led to a widespread but mistaken belief that an abundance of M/S frames, or the use of joint stereo in general, always negatively impacts channel separation and other measures of audio quality. This is not an issue with modern encoders. Modern, optimized encoders will switch between mid-side coding or simple stereo coding as necessary, depending on the correlation between the left and right channels, and will allocate channel bandwidth appropriately to ensure the best mode is used for each frame.


LAME M/S is known to better preserve stereo image than dual-mono in most circumstances, given the same bitrate budget. See [[Lossy]].


==Additional information==
=== Vorbis ===
Some more details about joint stereo & mid-side coding:
[[Vorbis]] treats stereo information with '''square polar mapping''' which is beneficial when the correlation between the left and right channels are strong (this can also be extended to multichannel coupling as well). In Vorbis, the spectrum of each channel is normalized against a floor function, which is a rough envelope of the actual spectrum. In the square polar mapping, the (stereo) phase is roughly defined as the difference between the normalized left and right amplitude of a given frequency component. If the original left and right channel are the same within a certain frequency band, apart from an overall scaling factor, then the normalized frequency spectrum is the same left and right and the stereo phase is zero over the whole frequency band. Note that in the context of polar mapping, the term 'phase' (here: 'stereo phase') has a very different meaning from the phase of a periodic wave. Unlike in the Fourier Transform, the Cosine Transform used in Vorbis and other encoders only provides amplitudes and no phases of the latter type.


* Bugs and/or not-optimized encoders may implement mid-side coding incorrectly, making mid-side coding sound worse than simple stereo, while in reality (see the formulas above) there should be no difference in quality between mid-side stereo and simple stereo.
Once the stereo information is represented in polar mapping as a magnitude and stereo phase, Vorbis can use three coupling methods:<ref>[http://www.xiph.org/vorbis/doc/stereo.html Ogg Vorbis stereo-specific channel coupling] at xiph.org.</ref>
* '''Lossless coupling''' is mathematically equivalent to independent encoding of the two channels ('dual mono' in MP3), but with the benefit of additional space-saving. It does polar mapping/channel interleaving using the residue vectors.
* In '''phase stereo''', the stereo phase is quantized, i.e. stored at a lower resolution. Especially above 4 kHz, the ear is not very sensitive to phase information. Phase stereo is '''not''' currently implemented in reference encoder due to complexity, but will be re-added again later on.  Note that phase stereo should not be compared to intensity stereo in MP3 coding.
* In [[point stereo]], the stereo phase is discarded completely. All the stereo information comes from the difference in the spectral floors for the left and right channels.


* Some older MP3 encoders interpret "joint stereo" to mean 100% mid-side, which for stereo music is almost certainly less than ideal.
Ogg Vorbis uses lossless/point stereo coupling below ''-q 6''. Lossless channel coupling is used for high bitrates entirely (''-q 6 and up''). This can be adjusted via an advanced-encode switch, but is not done for simplicity's sake.


* Modern/optimized encoders will use mid-side coding or simple stereo coding as necessary, depending on the correlation between the left and right channels.
=== Opus ===


==External Links==
Opus is capable of multi-mono, M/S with tunable weight factor, and intensity stereo. It avoids multi-mono unless explicitly asked for, and decide among M/S and intensity by the bitrate available and audio content. It also calculates the stereo width to decide the total amount of bitrate needed.
* [http://harmsy.freeuk.com/mostync/ Joint stereo myths and realities] ''(very important read, but fails to observe that a "joint stereo" file may very well contain 100% simple stereo data!)''


* [http://www.audiocoding.com/ Audiocoding] -- written by Menno Bakker
With surround input, Opus can only couple to pairs of joint-stereo. It does take advantage of surround masking.


* [http://www.High-Quality.ch.vu/ High Quality Audio guides] -- written by user<br />'''Note: Despite this page's recommendation for encoding in MPC, please be aware that in recent listening tests, other encodings are also proven to be at least as good as MPC, if not better.
With ambisonic input, Opus can use a fixed matrix, or do multi-mono.


* [http://en.wikipedia.org/wiki/Joint_stereo joint stereo at wikipedia]
==External Links==
* [http://en.wikipedia.org/wiki/Joint_stereo joint stereo at Wikipedia]
* [http://www.codingtechnologies.com/products/paraSter.htm Parametric Stereo at Coding Technologies]


==References==
<references/>


[[Category:Technical]]
[[Category:Technical]]
[[Category:Algorithms]]

Latest revision as of 05:12, 12 August 2023

Joint stereo refers to any stereo-encoding method that goes beyond simple encoding as two independent channels ("simple" or "L/R" stereo or DualMono). These methods exploit the similarities between channels and typically allow for more bits to be effectively used, increasing audio quality for a given bitrate. They are, however, not guaranteed to be perfect and could instead cause audible artifacts (mostly on older encoders).

Some file formats, such as MP3, can do switch among these formats on-the-fly on a frame or sub-frame basis, for the sake of efficiency or quality. For example, a high-bitrate "joint stereo" MP3 file may contain a mixture of SS and MS frames, or it may contain all SS frames or all MS frames. Due to some historical accident, the term as applied in MP3 refers to a mixture of coding formats. In other words, a non-"joint stereo" MP3 will never contain a mixture of frame types.

Stereo coding methods or "modes"

Left-Right (L/R) or "Simple" Stereo (SS)

Simple stereo is the most straightforward method of coding a stereo signal: each channel is treated as a completely separate entity. This can be inefficient and may adversely impact quality (as compared to other modes) when both channels contain nearly identical signals (i.e., are mono or nearly so).

Mid-side Stereo (MS)

Mid-side stereo coding calculates a "mid"-channel by addition of left and right channel, and a "side"-channel by subtraction, i.e.:

Encoding
M = (L + R) / 2, S = (L - R) / 2
Decoding
L = M + S, R = M - S

Whenever a signal is concentrated in the middle of the stereo image (i.e. more mono-like), mid-side stereo can achieve a significant saving in bitrate, since one can use fewer bits to encode the side-channel. Even more important is the fact that by applying the inverse matrix in the decoder, the quantization noise becomes correlated and falls in the middle of the stereo image, where it is masked by the signal.

Unlike intensity stereo which destroys phase information, mid-side coding is mathematically lossless (although subsequent lossy compression may cause phase degredation). Correctly implemented mid-side stereo does very little or no damage to the stereo image and increases compression efficiency either by reducing size or increasing overall quality. Mid-side is also simple enough to be implemented in FM radio and stereophonic Vinyl.

Mid-side stereo can use coefficients other than 1 in encoding and decoding. Allowing different contributions from each channel allows the codec to adapt to off-balance sources and retain the bitrate savings. This extension is found in opus, where an angle can be encoded.[1]

Intensity stereo

Intensity stereo coding is a method that achieves a saving in bitrate by replacing the left and the right signal by a single representing signal plus directional information (in the form of amplitude ratios for each frequency range). This replacement is psychoacoustically justified in the higher frequency range since the human auditory system is insensitive to the signal phase at frequencies above approximately 2 kHz.[2] To maintain the justification, a codec may only apply intensity stereo to higher-frequency parameters.[1]

Intensity stereo is by definition a lossy coding method thus it is primarily useful at low bitrates. For coding at higher bitrates only mid-side stereo should be used.

Parametric stereo

Parametric stereo, found in HE-AAC, is similar to intensity stereo, except that the directional information also includes phase and correlation. The phase information makes this algorithm also capable of keeping low frequency location cues (by inter-aural time differences), while the (de-)correlation information helps add ambience by synthesizing some difference between channels.[3]

PS replaces a whole channel with only 2-3 kbit/s of side information. As a result, the remaining channel gets almost double the bitrate to use, so the quality gain can more than makes up for the lossiness of the process. It is not useful at high bitrate.

The phase aspect is covered by a few patents applied in 1997~2000 (EP1107232A3, EP0797324A2), which should have expired. The ambience part (EP1927266B1) will expire in 2026, so do not expect any new experimental codec to use it yet.

More channels

The general idea of exploiting the redundancy among channels is called channel coupling.

Surround

Surround is structured like stereo in some ways, except now there are many more pairs that can be coupled together. The basic approach is to code together the corresponding pairs of left and right using ordinary joint stereo techniques.

In MPEG Surround, a process similar to parametric stereo is used to three streams into two, or two streams into one – plus a small stream of side information. A stream created by merging itself can be merged, creating a hierarchy of merges. For example, a 5.1 stream can be encoded as merges of C/LFE, L/Ls, R/Rs, then these three streams can be mixed down if needed.

Ambisonics

Ambisonics represents an entire sound field. In the raw representation, everything is based on spherical harmonics.

  • Multi-mono lossy encoding is unacceptably bad for ambisonics. Each stream does its own thing with the phase, resulting in a incoherent sound image.[4]
  • A fixed encoding matrix, such as the one in Opus, is passable. Sources a fixed direction gets much better quality (because it only goes in one stream: no chance of phase inconsistencies), and if the underlying codec is given enough bitrate to not mess with phase too much, the rest can be okay too.
  • MPEG-H 3D Audio isolates each source in space from the input, storing a representation based on objects. This should not have any preferred direction.

By format

MP3

MP3 supports dual-mono, M/S, and intensity methods. LAME does not support intensity stereo.

Some early MP3 encoders didn't make ideal decisions about what mode to use from frame to frame in joint stereo files, or how much bandwidth to allocate to encoding the side channel. This led to a widespread but mistaken belief that an abundance of M/S frames, or the use of joint stereo in general, always negatively impacts channel separation and other measures of audio quality. This is not an issue with modern encoders. Modern, optimized encoders will switch between mid-side coding or simple stereo coding as necessary, depending on the correlation between the left and right channels, and will allocate channel bandwidth appropriately to ensure the best mode is used for each frame.

LAME M/S is known to better preserve stereo image than dual-mono in most circumstances, given the same bitrate budget. See Lossy.

Vorbis

Vorbis treats stereo information with square polar mapping which is beneficial when the correlation between the left and right channels are strong (this can also be extended to multichannel coupling as well). In Vorbis, the spectrum of each channel is normalized against a floor function, which is a rough envelope of the actual spectrum. In the square polar mapping, the (stereo) phase is roughly defined as the difference between the normalized left and right amplitude of a given frequency component. If the original left and right channel are the same within a certain frequency band, apart from an overall scaling factor, then the normalized frequency spectrum is the same left and right and the stereo phase is zero over the whole frequency band. Note that in the context of polar mapping, the term 'phase' (here: 'stereo phase') has a very different meaning from the phase of a periodic wave. Unlike in the Fourier Transform, the Cosine Transform used in Vorbis and other encoders only provides amplitudes and no phases of the latter type.

Once the stereo information is represented in polar mapping as a magnitude and stereo phase, Vorbis can use three coupling methods:[5]

  • Lossless coupling is mathematically equivalent to independent encoding of the two channels ('dual mono' in MP3), but with the benefit of additional space-saving. It does polar mapping/channel interleaving using the residue vectors.
  • In phase stereo, the stereo phase is quantized, i.e. stored at a lower resolution. Especially above 4 kHz, the ear is not very sensitive to phase information. Phase stereo is not currently implemented in reference encoder due to complexity, but will be re-added again later on. Note that phase stereo should not be compared to intensity stereo in MP3 coding.
  • In point stereo, the stereo phase is discarded completely. All the stereo information comes from the difference in the spectral floors for the left and right channels.

Ogg Vorbis uses lossless/point stereo coupling below -q 6. Lossless channel coupling is used for high bitrates entirely (-q 6 and up). This can be adjusted via an advanced-encode switch, but is not done for simplicity's sake.

Opus

Opus is capable of multi-mono, M/S with tunable weight factor, and intensity stereo. It avoids multi-mono unless explicitly asked for, and decide among M/S and intensity by the bitrate available and audio content. It also calculates the stereo width to decide the total amount of bitrate needed.

With surround input, Opus can only couple to pairs of joint-stereo. It does take advantage of surround masking.

With ambisonic input, Opus can use a fixed matrix, or do multi-mono.

External Links

References

  1. a b See e.g. https://web.archive.org/web/20180714000735/http://jmvalin.ca/papers/aes135_opus_celt.pdf, sections 4.5 [IS frequency], 4.5.1 [M/S angle]
  2. http://www.hydrogenaudio.org/forums/index.php?showtopic=1491&view=findpost&p=14091
  3. Purnhagen, Heiko (October 5–8, 2004). LOW COMPLEXITY PARAMETRIC STEREO CODING IN MPEG-4" (PDF). 7th International Conference on Digital Audio Effects: 163–168.
  4. Phase/ambisonic issue discussed in: Mahé, Pierre; Ragot, Stéphane; Marchand, Sylvain (2 September 2019). First-Order Ambisonic Coding with PCA Matrixing and Quaternion-Based Interpolation. 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK. p. 284.
  5. Ogg Vorbis stereo-specific channel coupling at xiph.org.