Introduction
Demucs is a music source separation library developed at Meta Research. Its latest version, Hybrid Transformer Demucs (HTDemucs), combines temporal convolutions with a transformer architecture to separate mixed audio into individual instrument stems with high fidelity, enabling applications from karaoke creation to music production and remixing.
What Demucs Does
- Separates music into four default stems: vocals, drums, bass, and other instruments
- Offers a two-stem mode for quick vocal/accompaniment separation
- Processes audio files in MP3, WAV, FLAC, and other common formats
- Supports GPU-accelerated and CPU-only processing
- Provides a fine-tuned 6-stem model that adds piano and guitar separation
Architecture Overview
HTDemucs combines a temporal convolutional U-Net with a transformer encoder in a hybrid architecture. The convolutional branch processes the waveform directly while a parallel spectral branch operates on STFT representations. A cross-attention transformer module fuses information between the two domains. The model is trained end-to-end with a combination of L1 loss on waveforms and multi-resolution STFT loss, using the MUSDB18-HQ dataset and additional internal training data.
Self-Hosting & Configuration
- Install from PyPI with a single pip command
- Works on CPU for basic use; CUDA GPU recommended for faster processing
- Typical GPU processing speed is 5-10x faster than real-time on consumer hardware
- Models are downloaded automatically on first use (approximately 80 MB per model)
- Adjustable overlap and chunk size parameters trade speed for separation quality
Key Features
- Hybrid transformer-convolution architecture achieves state-of-the-art separation quality
- Simple CLI interface requires just one command to separate a track
- Python API available for integration into audio processing pipelines
- Multiple pre-trained models including the fine-tuned htdemucs_ft for best quality
- Supports segment-based processing for long tracks with limited memory
Comparison with Similar Tools
- Spleeter — Deezer open-source separator, faster but lower quality than Demucs
- Open-Unmix — reference implementation for music separation, lightweight but less accurate
- BSRNN — band-split recurrent network with competitive quality but less accessible
- Music Source Separation (LALAL.AI) — commercial service with good quality, no local deployment
- UVR (Ultimate Vocal Remover) — GUI tool that wraps multiple models including Demucs
FAQ
Q: How long does separation take? A: On a modern NVIDIA GPU, Demucs processes a 4-minute song in approximately 30-60 seconds. CPU processing takes 5-10 minutes for the same track.
Q: Can I separate stems other than the default four? A: The htdemucs_6s model provides 6 stems: vocals, drums, bass, guitar, piano, and other. Custom stem configurations require retraining.
Q: Does Demucs work on podcasts or speech audio? A: Demucs is optimized for music separation. For speech separation or noise removal, dedicated speech enhancement models may perform better.
Q: What audio quality does Demucs output? A: Demucs outputs at the same sample rate as the input. For best results, use high-quality source files (WAV or FLAC at 44.1 kHz or higher).