Skills2026年5月21日·1 分钟阅读

Demucs — AI-Powered Music Source Separation

Demucs is a state-of-the-art music source separation model from Meta Research that splits audio tracks into vocals, drums, bass, and other instrument stems.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Demucs Overview
通用 CLI 安装命令
npx tokrepo install d9e3e25f-54cb-11f1-9bc6-00163e2b0d79

Introduction

Demucs is a music source separation library developed at Meta Research. Its latest version, Hybrid Transformer Demucs (HTDemucs), combines temporal convolutions with a transformer architecture to separate mixed audio into individual instrument stems with high fidelity, enabling applications from karaoke creation to music production and remixing.

What Demucs Does

  • Separates music into four default stems: vocals, drums, bass, and other instruments
  • Offers a two-stem mode for quick vocal/accompaniment separation
  • Processes audio files in MP3, WAV, FLAC, and other common formats
  • Supports GPU-accelerated and CPU-only processing
  • Provides a fine-tuned 6-stem model that adds piano and guitar separation

Architecture Overview

HTDemucs combines a temporal convolutional U-Net with a transformer encoder in a hybrid architecture. The convolutional branch processes the waveform directly while a parallel spectral branch operates on STFT representations. A cross-attention transformer module fuses information between the two domains. The model is trained end-to-end with a combination of L1 loss on waveforms and multi-resolution STFT loss, using the MUSDB18-HQ dataset and additional internal training data.

Self-Hosting & Configuration

  • Install from PyPI with a single pip command
  • Works on CPU for basic use; CUDA GPU recommended for faster processing
  • Typical GPU processing speed is 5-10x faster than real-time on consumer hardware
  • Models are downloaded automatically on first use (approximately 80 MB per model)
  • Adjustable overlap and chunk size parameters trade speed for separation quality

Key Features

  • Hybrid transformer-convolution architecture achieves state-of-the-art separation quality
  • Simple CLI interface requires just one command to separate a track
  • Python API available for integration into audio processing pipelines
  • Multiple pre-trained models including the fine-tuned htdemucs_ft for best quality
  • Supports segment-based processing for long tracks with limited memory

Comparison with Similar Tools

  • Spleeter — Deezer open-source separator, faster but lower quality than Demucs
  • Open-Unmix — reference implementation for music separation, lightweight but less accurate
  • BSRNN — band-split recurrent network with competitive quality but less accessible
  • Music Source Separation (LALAL.AI) — commercial service with good quality, no local deployment
  • UVR (Ultimate Vocal Remover) — GUI tool that wraps multiple models including Demucs

FAQ

Q: How long does separation take? A: On a modern NVIDIA GPU, Demucs processes a 4-minute song in approximately 30-60 seconds. CPU processing takes 5-10 minutes for the same track.

Q: Can I separate stems other than the default four? A: The htdemucs_6s model provides 6 stems: vocals, drums, bass, guitar, piano, and other. Custom stem configurations require retraining.

Q: Does Demucs work on podcasts or speech audio? A: Demucs is optimized for music separation. For speech separation or noise removal, dedicated speech enhancement models may perform better.

Q: What audio quality does Demucs output? A: Demucs outputs at the same sample rate as the input. For best results, use high-quality source files (WAV or FLAC at 44.1 kHz or higher).

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产