SoundStream is the first neural audio codec to work on speech and music, while being able to run in real-time on a smartphone CPU. The main technical ingredient of SoundStream is a neural network, consisting of an encoder, decoder and quantizer, all of which are trained end-to-end.
It is able to deliver state-of-the-art quality over a broad range of bitrates with a single trained model, which represents a significant advance in learnable codecs.
SoundStream produces vectors that can take an indefinite number of values. To transmit them to the receiver using a limited number of bits, it is necessary to replace them by close vectors from a finite set (called a codebook) This approach works well at bitrate around 1 kbps or lower, but quickly reaches its limits when using higher bitrates as indicated by the comparative examples found on the original article.
By splitting the quantization process in several layers, the codebook size can be reduced drastically. This pushes the decoder to perform well at any bitrate of the incoming audio stream and thus helps SoundStream become "scalable".
Efficient compression is necessary whenever one needs to transmit audio, whether when streaming a video, or during a conference call. SoundStream is an important step towards improving machine learning-driven audio codecs.
SoundStream can combine compression with background noise suppression, by activating and deactivating denoising dynamically. By integrating SoundStream with Lyra, developers can leverage existing Lyra APIs and tools for their work, providing both flexibility and better sound quality.
This content was summarized by an experimental AI. Feel free to let me know what you think in the comments!
The original article can be found here