AI Music & Audio Generation Pack
Ten picks for the musician, podcaster, and creator generating music or sound with AI: Bark and AudioCraft for generation, Cartesia and Chatterbox for vocals, MuseScore and LMMS for arrangement, Tone.js and howler.js for the web, Demucs for source separation, Audacity for cleanup and mastering — in install order.
What's in this pack
This is the rig for a musician, podcaster, or game/web creator who wants to generate audio with AI and finish it in tools they control — not lock the master to a SaaS web app. Every pick is either fully open-source or has a real API (no copy-paste-from-website workflows). Nine of ten are MIT or Apache-licensed.
The stack covers all five stages of an audio production pipeline. You don't need every tool — pick the row that matches your output (music, voice, SFX, score, web playback) and chain through.
Install in this pipeline order
Stage 1 — Generate
- Bark — transformer text-to-audio by Suno's research team. Speech, music, background noise, and sound effects from text prompts in 12+ languages, with non-speech tags like
[laughs]and[music]. MIT licensed, runs locally on ~12 GB VRAM. Start here when you want one model that does everything roughly. - AudioCraft (MusicGen) — Meta's PyTorch library for music and sound effect generation. Higher musical coherence than Bark for instrumental tracks, conditioned on text prompts or melody. The right pick when you specifically want music, not voice.
- Cartesia Sonic TTS — state-space-model voice with 75 ms time-to-first-audio, 100+ voices, 5 s cloning, streaming WebSocket. Cloud API. Use this when you need real-time vocal delivery (live agents, fast iteration on lyric takes).
- Chatterbox — open-source TTS by Resemble AI with fine-grained control over prosody, emotion, expressiveness. The self-hosted alternative to Cartesia/ElevenLabs when you want lyric or narration vocals that don't sound like a GPS voice.
Stage 2 — Arrange
- MuseScore — free open-source notation. The bridge between generated MIDI/melody ideas and a real arrangement. Export to MIDI, MusicXML, audio.
- LMMS — free cross-platform DAW with built-in synths, beat sequencer, and effects chain. Where AI-generated stems become a song. The open alternative to FL Studio / Ableton when you don't want to pay $200 just to layer four tracks.
Stage 3 — Mix on the web (optional, for shippable creators)
- Tone.js — Web Audio framework for interactive music. Use it when your output isn't a WAV but an experience (generative web music, interactive loops, browser instruments).
- howler.js — cross-browser audio playback library. Pair with Tone.js (Tone for synthesis, Howler for playback of finished assets). Three-line API that solves every browser audio bug you'd otherwise spend a weekend on.
Stage 4 — Repair / Source Separation
- Demucs — AI music source separation by Meta. Splits any track into drums / bass / vocals / other. The vocal removal step (karaoke from anything, isolate AI-generated vocals from generated backing, fix bleed in mixes).
Stage 5 — Master & Export
- Audacity — the cross-platform audio editor that ships every podcast and YouTube voiceover on Earth. Noise reduction, normalization, EQ, limiter, export to MP3/WAV/FLAC. Boring on purpose — the master should be predictable.
How they chain together
Text prompt / lyrics
│
├─ Bark (any audio) ──┐
├─ MusicGen (music) ──┤
├─ Cartesia (voice) ──┼─→ stems (WAV)
└─ Chatterbox (voice) ┘
│
┌────────────────────┘
▼
MuseScore (score / MIDI ideas) → LMMS (DAW arrange + layer)
│
├─ Demucs (separate / extract stems if needed)
│
▼
Audacity (cleanup, EQ, limiter, master)
│
├─ WAV / MP3 → ship to Spotify / YouTube / podcast host
└─ Tone.js + howler.js → ship to a web page
The critical hinge is Stage 2 (LMMS) — without a DAW, generated stems stay one-shot novelties. With a DAW, four Bark/MusicGen takes become a real song with structure.
Tradeoffs you'll hit
- Bark vs MusicGen — Bark is broader (voice + music + SFX) but musically looser. MusicGen is narrower (instrumental music) but more coherent. If your output is songs, use MusicGen for backing and Bark or Cartesia for vocals. If your output is podcast intros, sound effects, or atmosphere, Bark alone is enough.
- Cartesia vs Chatterbox — Cartesia is fastest (75 ms TTF audio) and best-sounding, but cloud API with usage costs. Chatterbox is self-hosted with no per-request fee. Cartesia for production live agents; Chatterbox for batch vocal generation where latency doesn't matter.
- Tone.js vs howler.js — Tone.js synthesizes (oscillators, instruments, scheduling). Howler.js plays back finished files cross-browser. Most projects need both. If you're not generating audio at runtime, just use Howler.
- Demucs as offensive vs defensive tool — offensive: pull stems out of any reference track to study or remix. Defensive: separate AI-generated vocals from AI-generated backing when they share artifacts in the same render.
- Suno/Udio web UI vs this stack — Suno's web app is faster for a single 30-second meme. This stack wins the moment you want to iterate (regenerate just the chorus), own the master (no DRM, your WAV), or compose at scale (batch 50 prompts overnight).
Common pitfalls
- Bark VRAM — full model needs ~12 GB VRAM. On 8 GB GPUs set
SUNO_USE_SMALL_MODELS=True. CPU mode works but is 10× slower. - AudioCraft license confusion — MusicGen weights are CC-BY-NC for some checkpoints. Read the model card before you ship commercially.
- Demucs is slow on CPU — a 4-minute song takes ~3 minutes on CPU, ~20 seconds on a 3060. Batch overnight on CPU; interactive only with GPU.
- Audacity loudness war — don't push the limiter past -1 dBTP. Loud masters that distort on Spotify get auto-attenuated.
- Cartesia streaming + browser — WebSocket audio chunks need careful buffering; use Tone.js or Howler.js for client-side playback rather than raw
<audio>tags.
10 assets in this pack
Frequently asked questions
Can I actually replace Suno or Udio with this stack?
For one-off 30-second clips, no — Suno's web app is faster. For everything else (iterating just the chorus, owning the master file, batch generating 50 takes, fine-tuning vocals separately from backing), yes. The stack gives you a producer's workflow instead of a slot-machine UI. MusicGen and Bark together cover the generation surface; LMMS gives you the arrangement layer Suno's UI hides; Demucs lets you pull stems Suno never exposes.
Which voice model should I pick for AI singing?
None of these are tuned for singing specifically — they're speech models. For sung vocals, Bark with the right voice preset and [singing] tags is the loose creative option. Cartesia and Chatterbox produce more controlled but distinctly spoken-sounding output; you can pitch-shift them in LMMS to fake melody but the result feels like talking through autotune. Real AI singing today still routes through Suno's hosted model. This pack is honest about that gap.
What's the minimum hardware to run the local-only path?
Apple Silicon Mac (M1+) or a desktop with 12 GB VRAM (RTX 3060 or better) runs Bark, MusicGen, Demucs, and Chatterbox locally at usable speeds. On 8 GB cards use small-model flags. CPU-only is possible for all four but expect 10× slower generation — fine for overnight batches, painful for iteration.
How do I get clean stems out of AI-generated music?
Generate four short variations of the same prompt in MusicGen, run each through Demucs to separate drums / bass / vocals / other, then re-layer the best parts in LMMS. This is the cheat code: generation models give you mediocre full mixes, but Demucs lets you cherry-pick the one good drum line from take 3 and the bass from take 1. Cleaner than re-rolling for hours hoping the whole take lands.
Do I need both Tone.js and howler.js?
Only if you're shipping audio to a website. Howler.js is for playing finished files (your mastered WAV from Audacity) with reliable cross-browser autoplay handling. Tone.js is for synthesizing or sequencing audio in the browser (generative music, interactive instruments). Static music site: Howler only. Generative web instrument: both — Tone synthesizes, Howler plays back any baked samples.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs