Scripts2026年5月21日·1 分钟阅读

AudioCraft — AI Audio Generation by Meta

AudioCraft is a PyTorch library from Meta Research providing code and pre-trained models for audio generation including music, sound effects, and audio compression.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
AudioCraft Overview
通用 CLI 安装命令
npx tokrepo install 8a0d7a57-54cb-11f1-9bc6-00163e2b0d79

Introduction

AudioCraft is a unified framework from Meta Research that brings together state-of-the-art generative audio models. It includes MusicGen for text-to-music, AudioGen for text-to-sound-effects, and EnCodec for neural audio compression, all accessible through a clean Python API.

What AudioCraft Does

  • Generates music from text descriptions or melody conditioning via MusicGen
  • Creates sound effects and ambient audio from text prompts via AudioGen
  • Compresses audio at very low bitrates with high quality via the EnCodec neural codec
  • Supports melody-conditioned generation to produce music following a given tune
  • Provides multiple model sizes from 300M to 3.3B parameters for different compute budgets

Architecture Overview

MusicGen and AudioGen use a single-stage autoregressive transformer that operates on tokenized audio representations from EnCodec. Unlike prior work that uses multiple stages of generation, AudioCraft introduces an efficient codebook interleaving pattern that allows a single transformer to generate all codebook streams in parallel. EnCodec is a convolutional encoder-decoder with a residual vector quantization bottleneck that compresses audio at bitrates as low as 1.5 kbps while maintaining perceptual quality.

Self-Hosting & Configuration

  • Install from PyPI with pip or clone the repository for development
  • Requires PyTorch 2.0+ and a CUDA-capable GPU for generation
  • Small model (300M) runs on 4 GB VRAM; large model (3.3B) needs 16 GB+
  • Pre-trained weights download automatically from Hugging Face on first use
  • Gradio demo script included for a web-based generation interface

Key Features

  • Text-to-music generation with controllable duration up to 30 seconds
  • Melody conditioning allows music generation guided by a hummed or recorded tune
  • EnCodec neural codec achieves high-quality compression at 1.5-24 kbps
  • Single-stage transformer avoids cascaded model complexity
  • Stereo and mono generation supported across model sizes

Comparison with Similar Tools

  • Stable Audio — commercial offering from Stability AI with longer outputs but closed weights
  • MusicLM — Google research model with strong quality but no public weights or code
  • Bark — generates speech, music, and effects but with less musical coherence than MusicGen
  • Riffusion — uses spectrograms with Stable Diffusion for music, creative but lower fidelity
  • AIVA — symbolic AI composer for sheet music, different paradigm from waveform generation

FAQ

Q: How long can generated audio clips be? A: MusicGen can generate clips up to 30 seconds. Longer compositions require chunked generation with overlap blending.

Q: Can I fine-tune MusicGen on my own music dataset? A: Yes, AudioCraft includes training code for fine-tuning MusicGen on custom audio data with text descriptions.

Q: What audio formats are supported? A: AudioCraft works with WAV files internally at 32 kHz. Output can be saved to any format supported by torchaudio.

Q: Does AudioCraft support real-time streaming generation? A: The current implementation generates audio offline. Real-time streaming is not natively supported but EnCodec can encode and decode in a streaming fashion.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产