Picture a scenario where you type in “dramatic intro music” and are greeted with a majestic symphony, or you jot down “creepy footsteps” and instantly receive top-tier sound effects. This is precisely what Stable Audio, an AI model unveiled by Stability AI on a Wednesday, promises to deliver— the ability to synthesize music or sounds based on written descriptions. In the not-so-distant future, this technology could potentially pose a challenge to musicians and their job security.
To provide some context, Stability AI, the same company that supported the creation of Stable Diffusion, a latent diffusion image synthesis model released in August 2022, has expanded its horizons beyond image generation into the realm of audio. They backed Harmonai, an AI lab that introduced the music generator Dance Diffusion in September. Now, in collaboration with Harmonai, Stability AI is venturing into the world of commercial AI audio production with Stable Audio. From initial production samples, it appears to represent a significant leap in audio quality compared to previous AI audio generators.
Stability showcases the capabilities of its AI model on its promotional page with prompts like “epic trailer music intense tribal percussion and brass” and “lofi hip hop beat melodic chillhop 85 bpm.” It also offers examples of sound effects created using Stable Audio, such as an airline pilot communicating over an intercom and people conversing in a bustling restaurant.
To train this model, Stability partnered with AudioSparx, a stock music provider, and utilized a dataset comprising over 800,000 audio files containing music, sound effects, single-instrument stems, and corresponding text metadata. After feeding a staggering 19,500 hours of audio into the model, Stable Audio has the capability to replicate specific sounds described in text because it has learned to associate textual descriptions with those sounds within its neural network.
Stable Audio comprises multiple components working in unison to swiftly generate customized audio. One component condenses the audio file while retaining essential features and eliminating unnecessary noise, enhancing both the speed of training and the speed of audio creation. Another component utilizes textual metadata descriptions of the music and sounds to guide the generation of audio.
To expedite the process, Stable Audio employs a highly simplified, compressed audio representation to reduce inference time—the time taken by a machine learning model to produce an output after receiving input. According to Stability AI, Stable Audio can generate 95 seconds of stereo audio at a 44.1 kHz sample rate (often referred to as “CD quality”) in less than one second using an Nvidia A100 GPU. The A100 is a powerful data center GPU designed for AI applications, surpassing the capabilities of a typical desktop gaming GPU.
It’s worth noting that Stable Audio is not the first music generator based on latent diffusion techniques. Prior examples include Riffusion, a hobbyist take on an audio version of Stable Diffusion, but its output quality lags behind Stable Audio’s samples. In January, Google introduced MusicLM, an AI music generator for 24 kHz audio, and in August, Meta launched a suite of open-source audio tools, including a text-to-music generator called AudioCraft. With its 44.1 kHz stereo audio capabilities, Stable Diffusion is raising the bar.
Stability plans to offer Stable Audio in two tiers: a free option and a $12 monthly Pro plan. The free tier permits users to generate up to 20 tracks per month, each with a maximum duration of 20 seconds. The Pro plan extends these limits, allowing for 500 track generations per month with track lengths of up to 90 seconds. Future releases from Stability are anticipated to include open-source models based on the Stable Audio architecture and training code for those interested in developing audio generation models.
As it stands, Stable Audio appears to be ushering in the era of high-quality AI-generated music, given its impressive audio fidelity. However, it remains uncertain whether musicians will welcome the prospect of being replaced by AI models. History suggests that professionals in the visual arts field have resisted such advancements. For now, human creativity still outshines anything AI can produce, but the balance may shift in the future. In any case, AI-generated audio might become an additional tool in the arsenal of audio professionals.