TOKREPO · ARSENAL

Stable

AI Music & Audio Generation Pack

Ten picks for the musician, podcaster, and creator generating music or sound with AI: Bark and AudioCraft for generation, Cartesia and Chatterbox for vocals, MuseScore and LMMS for arrangement, Tone.js and howler.js for the web, Demucs for source separation, Audacity for cleanup and mastering — in install order.

10 assets

About this pack

What's in this pack

This is the rig for a musician, podcaster, or game/web creator who wants to generate audio with AI and finish it in tools they control — not lock the master to a SaaS web app. Every pick is either fully open-source or has a real API (no copy-paste-from-website workflows). Nine of ten are MIT or Apache-licensed.

The stack covers all five stages of an audio production pipeline. You don't need every tool — pick the row that matches your output (music, voice, SFX, score, web playback) and chain through.

Install in this pipeline order

Stage 1 — Generate

Bark — transformer text-to-audio by Suno's research team. Speech, music, background noise, and sound effects from text prompts in 12+ languages, with non-speech tags like [laughs] and [music]. MIT licensed, runs locally on ~12 GB VRAM. Start here when you want one model that does everything roughly.
AudioCraft (MusicGen) — Meta's PyTorch library for music and sound effect generation. Higher musical coherence than Bark for instrumental tracks, conditioned on text prompts or melody. The right pick when you specifically want music, not voice.
Cartesia Sonic TTS — state-space-model voice with 75 ms time-to-first-audio, 100+ voices, 5 s cloning, streaming WebSocket. Cloud API. Use this when you need real-time vocal delivery (live agents, fast iteration on lyric takes).
Chatterbox — open-source TTS by Resemble AI with fine-grained control over prosody, emotion, expressiveness. The self-hosted alternative to Cartesia/ElevenLabs when you want lyric or narration vocals that don't sound like a GPS voice.

Stage 2 — Arrange

MuseScore — free open-source notation. The bridge between generated MIDI/melody ideas and a real arrangement. Export to MIDI, MusicXML, audio.
LMMS — free cross-platform DAW with built-in synths, beat sequencer, and effects chain. Where AI-generated stems become a song. The open alternative to FL Studio / Ableton when you don't want to pay $200 just to layer four tracks.

Stage 3 — Mix on the web (optional, for shippable creators)

Tone.js — Web Audio framework for interactive music. Use it when your output isn't a WAV but an experience (generative web music, interactive loops, browser instruments).
howler.js — cross-browser audio playback library. Pair with Tone.js (Tone for synthesis, Howler for playback of finished assets). Three-line API that solves every browser audio bug you'd otherwise spend a weekend on.

Stage 4 — Repair / Source Separation

Demucs — AI music source separation by Meta. Splits any track into drums / bass / vocals / other. The vocal removal step (karaoke from anything, isolate AI-generated vocals from generated backing, fix bleed in mixes).

Stage 5 — Master & Export

Audacity — the cross-platform audio editor that ships every podcast and YouTube voiceover on Earth. Noise reduction, normalization, EQ, limiter, export to MP3/WAV/FLAC. Boring on purpose — the master should be predictable.

How they chain together

Text prompt / lyrics
   │
   ├─ Bark (any audio) ──┐
   ├─ MusicGen (music) ──┤
   ├─ Cartesia (voice) ──┼─→ stems (WAV)
   └─ Chatterbox (voice) ┘
                          │
     ┌────────────────────┘
     ▼
MuseScore (score / MIDI ideas) → LMMS (DAW arrange + layer)
     │
     ├─ Demucs (separate / extract stems if needed)
     │
     ▼
Audacity (cleanup, EQ, limiter, master)
     │
     ├─ WAV / MP3 → ship to Spotify / YouTube / podcast host
     └─ Tone.js + howler.js → ship to a web page

The critical hinge is Stage 2 (LMMS) — without a DAW, generated stems stay one-shot novelties. With a DAW, four Bark/MusicGen takes become a real song with structure.

Tradeoffs you'll hit

Bark vs MusicGen — Bark is broader (voice + music + SFX) but musically looser. MusicGen is narrower (instrumental music) but more coherent. If your output is songs, use MusicGen for backing and Bark or Cartesia for vocals. If your output is podcast intros, sound effects, or atmosphere, Bark alone is enough.
Cartesia vs Chatterbox — Cartesia is fastest (75 ms TTF audio) and best-sounding, but cloud API with usage costs. Chatterbox is self-hosted with no per-request fee. Cartesia for production live agents; Chatterbox for batch vocal generation where latency doesn't matter.
Tone.js vs howler.js — Tone.js synthesizes (oscillators, instruments, scheduling). Howler.js plays back finished files cross-browser. Most projects need both. If you're not generating audio at runtime, just use Howler.
Demucs as offensive vs defensive tool — offensive: pull stems out of any reference track to study or remix. Defensive: separate AI-generated vocals from AI-generated backing when they share artifacts in the same render.
Suno/Udio web UI vs this stack — Suno's web app is faster for a single 30-second meme. This stack wins the moment you want to iterate (regenerate just the chorus), own the master (no DRM, your WAV), or compose at scale (batch 50 prompts overnight).

Common pitfalls

Bark VRAM — full model needs ~12 GB VRAM. On 8 GB GPUs set SUNO_USE_SMALL_MODELS=True. CPU mode works but is 10× slower.
AudioCraft license confusion — MusicGen weights are CC-BY-NC for some checkpoints. Read the model card before you ship commercially.
Demucs is slow on CPU — a 4-minute song takes ~3 minutes on CPU, ~20 seconds on a 3060. Batch overnight on CPU; interactive only with GPU.
Audacity loudness war — don't push the limiter past -1 dBTP. Loud masters that distort on Spotify get auto-attenuated.
Cartesia streaming + browser — WebSocket audio chunks need careful buffering; use Tone.js or Howler.js for client-side playback rather than raw <audio> tags.

INSTALL · ONE COMMAND

$ tokrepo install pack/ai-music-audio-generation

hand it to your agent — or paste it in your terminal

What's inside

10 assets in this pack

Skill#01

Bark — AI Text-to-Audio with Music & Effects

Bark is a transformer text-to-audio model by Suno that generates speech, music, and sound effects. 39.1K+ GitHub stars. 12+ languages, 100+ voice presets, non-speech audio. MIT licensed.

by Script Depot·359 views

$ tokrepo install bark-ai-text-audio-music-effects-814b8972

Skill#02

AudioCraft — AI Audio Generation by Meta

AudioCraft is a PyTorch library from Meta Research providing code and pre-trained models for audio generation including music, sound effects, and audio compression.

by Script Depot·129 views

$ tokrepo install audiocraft-ai-audio-generation-meta-8a0d7a57

Skill#03

Cartesia Sonic TTS — 75ms Time-to-First-Audio

Cartesia Sonic is a state-space-model TTS with 75ms time-to-first-audio. 100+ voices, 5s cloning, streaming WebSocket. Lowest-latency TTS.

by Cartesia·224 views

$ tokrepo install cartesia-sonic-tts-75ms-time-to-first-audio

Skill#04

Chatterbox — State-of-the-Art Open Source Text-to-Speech

A high-quality open-source TTS model by Resemble AI that delivers natural-sounding speech with fine-grained control over prosody, emotion, and expressiveness.

by Script Depot·259 views

$ tokrepo install chatterbox-state-art-open-source-text-speech-a6af5d44

Skill#05

MuseScore — Free Open Source Music Notation Software

MuseScore is a free, open-source music notation application for composing, arranging, and engraving sheet music. It runs on Windows, macOS, and Linux, supports MusicXML import/export, MIDI playback, and produces professional-quality scores.

by AI Open Source·265 views

$ tokrepo install musescore-free-open-source-music-notation-software-7185dcfa

Skill#06

LMMS — Free Cross-Platform Digital Audio Workstation

LMMS (Linux MultiMedia Studio) is a free, open-source digital audio workstation for music production. It includes synthesizers, sample playback, beat sequencing, and an effects chain, providing a complete environment for creating music without any cost.

by Script Depot·258 views

$ tokrepo install lmms-free-cross-platform-digital-audio-workstation-c9a9b225

Skill#07

Tone.js — Web Audio Framework for Interactive Music

A TypeScript framework built on the Web Audio API that provides scheduling, synthesis, and effects for creating interactive music in the browser.

by Script Depot·195 views

$ tokrepo install tone-js-web-audio-framework-interactive-music-09935623

Script#08

howler.js — Cross-Browser Audio Library for the Web

A JavaScript audio library that provides a simple, consistent API for playing sound in any browser using the Web Audio API with HTML5 Audio fallback.

by AI Open Source·109 views

$ tokrepo install howler-js-cross-browser-audio-library-web-d9fc60d5

Skill#09

Demucs — AI-Powered Music Source Separation

Demucs is a state-of-the-art music source separation model from Meta Research that splits audio tracks into vocals, drums, bass, and other instrument stems.

by Script Depot·167 views

$ tokrepo install demucs-ai-powered-music-source-separation-d9e3e25f

Skill#10

Audacity — Free Cross-Platform Audio Editor

Audacity is a free, open-source digital audio editor and recorder for Windows, macOS, and Linux. It supports multi-track editing, a wide range of audio formats, real-time effects, and plugin extensibility for recording, editing, and mastering audio.

by AI Open Source·199 views

$ tokrepo install audacity-free-cross-platform-audio-editor-44f450b6

FAQ

Frequently asked questions

Can I actually replace Suno or Udio with this stack?

For one-off 30-second clips, no — Suno's web app is faster. For everything else (iterating just the chorus, owning the master file, batch generating 50 takes, fine-tuning vocals separately from backing), yes. The stack gives you a producer's workflow instead of a slot-machine UI. MusicGen and Bark together cover the generation surface; LMMS gives you the arrangement layer Suno's UI hides; Demucs lets you pull stems Suno never exposes.

Which voice model should I pick for AI singing?

None of these are tuned for singing specifically — they're speech models. For sung vocals, Bark with the right voice preset and [singing] tags is the loose creative option. Cartesia and Chatterbox produce more controlled but distinctly spoken-sounding output; you can pitch-shift them in LMMS to fake melody but the result feels like talking through autotune. Real AI singing today still routes through Suno's hosted model. This pack is honest about that gap.

What's the minimum hardware to run the local-only path?

Apple Silicon Mac (M1+) or a desktop with 12 GB VRAM (RTX 3060 or better) runs Bark, MusicGen, Demucs, and Chatterbox locally at usable speeds. On 8 GB cards use small-model flags. CPU-only is possible for all four but expect 10× slower generation — fine for overnight batches, painful for iteration.

How do I get clean stems out of AI-generated music?

Generate four short variations of the same prompt in MusicGen, run each through Demucs to separate drums / bass / vocals / other, then re-layer the best parts in LMMS. This is the cheat code: generation models give you mediocre full mixes, but Demucs lets you cherry-pick the one good drum line from take 3 and the bass from take 1. Cleaner than re-rolling for hours hoping the whole take lands.

Do I need both Tone.js and howler.js?

Only if you're shipping audio to a website. Howler.js is for playing finished files (your mastered WAV from Audacity) with reliable cross-browser autoplay handling. Tone.js is for synthesizing or sequencing audio in the browser (generative music, interactive instruments). Static music site: Howler only. Generative web instrument: both — Tone synthesizes, Howler plays back any baked samples.

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs