Introduction
AudioCraft is a unified framework from Meta Research that brings together state-of-the-art generative audio models. It includes MusicGen for text-to-music, AudioGen for text-to-sound-effects, and EnCodec for neural audio compression, all accessible through a clean Python API.
What AudioCraft Does
- Generates music from text descriptions or melody conditioning via MusicGen
- Creates sound effects and ambient audio from text prompts via AudioGen
- Compresses audio at very low bitrates with high quality via the EnCodec neural codec
- Supports melody-conditioned generation to produce music following a given tune
- Provides multiple model sizes from 300M to 3.3B parameters for different compute budgets
Architecture Overview
MusicGen and AudioGen use a single-stage autoregressive transformer that operates on tokenized audio representations from EnCodec. Unlike prior work that uses multiple stages of generation, AudioCraft introduces an efficient codebook interleaving pattern that allows a single transformer to generate all codebook streams in parallel. EnCodec is a convolutional encoder-decoder with a residual vector quantization bottleneck that compresses audio at bitrates as low as 1.5 kbps while maintaining perceptual quality.
Self-Hosting & Configuration
- Install from PyPI with pip or clone the repository for development
- Requires PyTorch 2.0+ and a CUDA-capable GPU for generation
- Small model (300M) runs on 4 GB VRAM; large model (3.3B) needs 16 GB+
- Pre-trained weights download automatically from Hugging Face on first use
- Gradio demo script included for a web-based generation interface
Key Features
- Text-to-music generation with controllable duration up to 30 seconds
- Melody conditioning allows music generation guided by a hummed or recorded tune
- EnCodec neural codec achieves high-quality compression at 1.5-24 kbps
- Single-stage transformer avoids cascaded model complexity
- Stereo and mono generation supported across model sizes
Comparison with Similar Tools
- Stable Audio — commercial offering from Stability AI with longer outputs but closed weights
- MusicLM — Google research model with strong quality but no public weights or code
- Bark — generates speech, music, and effects but with less musical coherence than MusicGen
- Riffusion — uses spectrograms with Stable Diffusion for music, creative but lower fidelity
- AIVA — symbolic AI composer for sheet music, different paradigm from waveform generation
FAQ
Q: How long can generated audio clips be? A: MusicGen can generate clips up to 30 seconds. Longer compositions require chunked generation with overlap blending.
Q: Can I fine-tune MusicGen on my own music dataset? A: Yes, AudioCraft includes training code for fine-tuning MusicGen on custom audio data with text descriptions.
Q: What audio formats are supported? A: AudioCraft works with WAV files internally at 32 kHz. Output can be saved to any format supported by torchaudio.
Q: Does AudioCraft support real-time streaming generation? A: The current implementation generates audio offline. Real-time streaming is not natively supported but EnCodec can encode and decode in a streaming fashion.