TOKREPO · ARSENAL

Stable

Voice Cloning + Podcast — One Person Runs the Whole Show

Ten picks for the indie podcaster, voice actor, or YouTuber running a whole show solo — Audacity for capture and noise cleanup, Whisper / whisper.cpp for transcription, ElevenLabs / OpenVoice / GPT-SoVITS / Fish Speech / Coqui TTS for voice clone and multilingual dubbing, KrillinAI for one-click 100-language video dub, VideoCaptioner for subtitle baking. Recording → cleanup → clone → dub → publish, in one rig.

10 assets

About this pack

What's in this pack

This is the rig an indie podcaster, voice actor, or YouTuber would build to run a whole show without a producer, sound engineer, or translation agency. Ten picks, opinionated order, every one of them either open-source or has a serious free tier. The point is not "all the tools that exist" — it's "the smallest set that lets one person record on Monday and ship a localized, captioned, denoised, voice-cloned cut by Friday".

Five layers, two picks per layer where there's a real tradeoff:

Layer	Picks	Why
1. Record + clean	Audacity	Free DAW. Records multi-track, removes hiss/breath/click, exports anything.
2. Transcribe	Whisper (cloud) · whisper.cpp (local)	Cloud Whisper for highest accuracy; whisper.cpp for offline / sensitive / batch / mobile.
3. Clone your voice	ElevenLabs · OpenVoice · GPT-SoVITS	ElevenLabs = top fidelity, paid. OpenVoice = instant tone+style clone, MIT. GPT-SoVITS = few-shot clone you self-host.
4. Dub into other languages	Fish Speech · Coqui TTS · KrillinAI	Fish Speech does 80+ languages. Coqui TTS = pluggable engine. KrillinAI takes a video file and dubs the whole thing in one click.
5. Caption + ship	VideoCaptioner	Burns word-level subtitles into vertical cuts for TikTok / Reels / Shorts.

The pack is sized for one operator. If you're running a 3-person podcast network with editors, swap Audacity for Reaper / Adobe Audition (paid), swap KrillinAI for a human translation pass, and add a publish/scheduling tool. For everyone else, this is the rig.

Install in this order

Do NOT install the voice clone tools first. You need a clean recording before cloning gives a usable result.

# Stage 1 — capture and clean (Monday)
tokrepo install audacity

# Stage 2 — get a transcript so you can edit by text, not by waveform (Monday night)
tokrepo install whisper-cpp        # local, free, ~5x realtime on M-series
# OR
tokrepo install whisper            # OpenAI API, highest accuracy

# Stage 3 — clone your own voice (Tuesday — you only do this once)
tokrepo install elevenlabs-python-sdk  # 3 min of clean audio → studio-grade clone
# OR — if you want to self-host / not pay per character
tokrepo install openvoice              # instant clone, MIT
tokrepo install gpt-sovits             # few-shot, GPU recommended

# Stage 4 — dub a clip into other languages (Wednesday)
tokrepo install fish-speech            # multilingual TTS, 80+ languages
tokrepo install coqui-tts              # self-hosted alternative
tokrepo install krillinai              # full-video dub, subtitles+voice, one command

# Stage 5 — publish (Thursday)
tokrepo install videocaptioner         # burn animated captions for social cuts

The TokRepo CLI drops each asset as a skill file in your repo. Claude Code / Cursor / Codex CLI read the skill and can wire up the script for you — "take episode-12.wav, denoise it in Audacity headless, transcribe with whisper.cpp, dub the first 60 seconds into Spanish with KrillinAI, burn captions with VideoCaptioner, output ep12-es.mp4" becomes a single agent prompt.

How they fit together

[ Mic / Riverside / Zoom recording ]
             │
             ▼
   ┌─────────────────────┐
   │ Audacity            │  noise gate, EQ, normalize, click removal
   └─────────────────────┘
             │  clean WAV
             ▼
   ┌─────────────────────┐
   │ Whisper / whisper.cpp │  transcript + word timestamps
   └─────────────────────┘
             │  edit by deleting text, not waveform
             ▼
   ┌─────────────────────────────────┐
   │ Voice clone (one of):          │
   │   ElevenLabs · OpenVoice ·     │  → your-voice model
   │   GPT-SoVITS                   │
   └─────────────────────────────────┘
             │
             ├──► re-record a flub: type the line, your-voice speaks it
             │
             ▼
   ┌─────────────────────────────────┐
   │ Multilingual dub (one of):     │
   │   Fish Speech (TTS engine) ·   │
   │   Coqui TTS · KrillinAI        │  → ES / JA / DE / FR audio track
   │   (full video pipeline)        │
   └─────────────────────────────────┘
             │
             ▼
   ┌─────────────────────┐
   │ VideoCaptioner      │  word-by-word burned captions, vertical cut
   └─────────────────────┘
             │
             ▼
   [ YouTube / Spotify / TikTok / Reels / Shorts ]

The big unlock here is editing by transcript, not by waveform. Once Whisper gives you a timestamped transcript, removing an um/uh becomes deleting a word from a text file and re-rendering. That's where the 5x speed-up actually comes from — not the cloning, not the dubbing, but never having to scrub through a 90-minute waveform.

Tradeoffs you'll hit

ElevenLabs vs OpenVoice vs GPT-SoVITS for cloning your own voice. ElevenLabs is the fidelity ceiling — 3 minutes of clean audio gets you a clone friends can't tell apart, but it's $5–$330/month + character overage and your voice model lives on their servers. OpenVoice is MIT-licensed and runs on a consumer GPU; quality is "good enough for podcast intros, not narration". GPT-SoVITS is the strongest open option but needs a fine-tune pass per voice. Pick ElevenLabs for fastest result, OpenVoice/GPT-SoVITS if licensing or recurring cost matters.
Cloud Whisper vs whisper.cpp. Cloud is the most accurate, especially on Chinese/Japanese/proper nouns. whisper.cpp runs on a MacBook with no internet, no per-minute cost, no data leaving your machine. Podcasts with named guests → cloud. Locked-down corporate / journalism with sources → local.
KrillinAI vs DIY (Fish Speech + Coqui). KrillinAI takes a video file and gives you the same video in a new language, lips kind of synced, subtitles included — one command. The DIY path (extract audio → transcribe → translate → re-TTS → mux back in) gives you control over each step but is 5x the integration work. Use KrillinAI for first pass; drop down to DIY when one step needs tuning.
Multilingual fidelity reality check. Chinese/Japanese/Korean clones from English-trained voice models will sound "foreign-accented". Fish Speech is the strongest multilingual TTS in this pack. For mission-critical localization (paid clients) you still want a native voice actor for the target language; clones get you to draft quality, not broadcast.
Realtime vs offline. Nothing in this pack is realtime — this is a production studio, not a live-stream rig. If you need live, look at Voice AI Stack pack instead.

Common pitfalls (and the ethical one)

You don't own the rights to clone someone else's voice. Cloning a guest, a public figure, a deceased person, or any voice you don't have explicit written consent from is a fast track to a lawsuit, a platform ban, and (in many jurisdictions) criminal liability. ElevenLabs requires a consent-recording before voice cloning. OpenVoice and GPT-SoVITS do not enforce this — you must. Get written consent before you clone anyone, and log it.
Model bias generates accents you didn't want. Voice cloning models trained predominantly on American English will make your Indian-English / Australian / Scottish accent sound subtly "American". Test the clone across your whole accent range before committing to a season of episodes.
Proper-noun transcription error rate. Whisper hallucinates names. "Linus Torvalds" comes out "Linus Torvalds" 90% of the time; "Anthropic" comes out "and topic". Build a custom vocabulary / post-process replace list for every recurring name on your show.
Long-audio token cost. Transcribing a 2-hour podcast through cloud Whisper is fine ($0.36 at $0.006/min). Dubbing a 2-hour podcast through ElevenLabs at the multilingual rate ≈ 100k chars/hour ≈ $20–60 per language per episode. Run the math before you promise "every episode in 10 languages".
VAD before everything. If you skip voice-activity detection and feed silent gaps to Whisper, you'll get the famous hallucinated transcript Thank you for watching! baked into your subtitles. Add a 30-line silero-vad pass before any STT call.
Not keeping the original master. Voice clone + re-mix + re-dub is a destructive chain. Always keep the original multi-track Audacity project — clients, lawyers, and future-you will all need it.

Ethical disclaimer

Voice cloning has legitimate uses: re-recording your own flubs, accessibility narration, dubbing your own content into languages you don't speak, voice preservation for ALS patients. It also has obvious abuses: impersonation fraud, non-consensual deepfakes, putting words in a public figure's mouth. This pack ships the tools. The rules are on you. Get explicit written consent before cloning any voice that isn't your own. Disclose AI-generated audio in your show notes. Many platforms (YouTube, TikTok, Spotify) now require disclosure of synthetic media and will demonetize / remove content that hides it. Build the disclosure into your publish step from day one.

INSTALL · ONE COMMAND

$ tokrepo install pack/voice-clone-podcast-studio

hand it to your agent — or paste it in your terminal

What's inside

10 assets in this pack

Skill#01

Audacity — Free Cross-Platform Audio Editor

Audacity is a free, open-source digital audio editor and recorder for Windows, macOS, and Linux. It supports multi-track editing, a wide range of audio formats, real-time effects, and plugin extensibility for recording, editing, and mastering audio.

by AI Open Source·206 views

$ tokrepo install audacity-free-cross-platform-audio-editor-44f450b6

Skill#02

Whisper — OpenAI Speech-to-Text

OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.

by OpenAI·417 views

$ tokrepo install whisper-openai-speech-text-eb0f9dd6

Skill#03

whisper.cpp — Local Speech-to-Text in Pure C/C++

High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.

by Script Depot·2132 views

$ tokrepo install whisper-cpp-local-speech-text-pure-c-c-e1fd7c46

Script#04

ElevenLabs Python SDK — AI Text-to-Speech

Official ElevenLabs Python SDK for AI voice generation. Create realistic voiceovers with 30+ languages, voice cloning, and streaming support.

by ElevenLabs·350 views

$ tokrepo install elevenlabs-python-sdk-ai-text-speech-16d32da9

Skill#05

OpenVoice — Instant Voice Cloning with Tone and Style Control

OpenVoice is an open-source voice cloning framework from MyShell AI that reproduces a speaker's voice from a short audio sample while giving independent control over emotion, accent, rhythm, and language.

by AI Open Source·207 views

$ tokrepo install openvoice-instant-voice-cloning-tone-style-control-ae7169ee

Skill#06

GPT-SoVITS — Few-Shot Voice Cloning and Text-to-Speech

An open-source TTS system that can clone any voice from just one minute of audio data, combining GPT-style language modeling with VITS synthesis for natural speech generation.

by AI Open Source·348 views

$ tokrepo install gpt-sovits-few-shot-voice-cloning-text-speech-8b48f7ce

Skill#07

Fish Speech — Multilingual TTS for 80+ Languages

Fish Speech is a state-of-the-art open-source TTS system supporting 80+ languages. 29K+ GitHub stars. 4B dual-AR model, voice cloning, emotional control with 15K+ tags, real-time inference.

by AI Open Source·419 views

$ tokrepo install fish-speech-multilingual-tts-80-languages-88c15e9c

Script#08

Coqui TTS — Deep Learning Text-to-Speech Engine

Generate speech in 1100+ languages with voice cloning. XTTS v2 streams with under 200ms latency. 44K+ GitHub stars.

by TokRepo精选·506 views

$ tokrepo install coqui-tts-deep-learning-text-speech-engine-a059dce2

Skill#09

KrillinAI — AI Video Translation and Dubbing in 100 Languages

An open-source tool that uses LLMs to translate and dub video content into over 100 languages with one-click deployment, optimized for YouTube, TikTok, and other platforms.

by AI Open Source·218 views

$ tokrepo install krillinai-ai-video-translation-dubbing-100-languages-e0ea662e

Skill#10

VideoCaptioner — AI Subtitle Pipeline

LLM-powered video subtitle tool: Whisper transcription + AI correction + 99-language translation + styled subtitle export. 13,800+ stars.

by Script Depot·450 views

$ tokrepo install videocaptioner-ai-subtitle-pipeline-d12d8441

FAQ

Frequently asked questions

Is it legal to clone my own voice?

Cloning your own voice for your own use is legal in essentially every jurisdiction. The trouble starts when you (1) clone a voice you don't have rights to — a guest, a celebrity, a deceased person; (2) use a clone to impersonate someone for fraud or defamation, even your own clone in someone else's hands; or (3) hide that audio is AI-generated on a platform that requires disclosure (YouTube, TikTok, Spotify, Meta all do now). For your own podcast intros, narration patches, and translated dubs of your own content, you're fine. For anything involving a second person, get written consent.

ElevenLabs vs Fish Speech vs OpenVoice — which one for what?

ElevenLabs is the quality leader for English/Spanish/German and a paid SaaS — pick it when fidelity matters more than recurring cost and you're okay with a cloud dependency. Fish Speech is the best open multilingual TTS in this pack — it covers 80+ languages including strong Chinese and Japanese, runs on your GPU, and is what you reach for when ElevenLabs sounds "too foreign" in your target language. OpenVoice is the fastest open clone — 3-second reference audio, MIT-licensed, runs on a consumer GPU, but quality tops out around "good podcast intro" not "broadcast narration". Typical setup: ElevenLabs for your main voice clone, Fish Speech for Chinese/Japanese dubs, OpenVoice for one-off character voices.

Which voice clone has the best Chinese quality?

For Chinese specifically: GPT-SoVITS and Fish Speech are both stronger than ElevenLabs out of the box, because they're trained on much more Chinese data. GPT-SoVITS in particular has a strong Chinese community and most public few-shot tutorials are Chinese-language. ElevenLabs has improved Chinese significantly in the last year but still has noticeable English-influenced tonal artifacts on the 4 tones. For a Chinese-language podcast or dub track, fine-tune GPT-SoVITS or Fish Speech on ~30 minutes of clean Mandarin reference; for a single Chinese sentence in an otherwise English show, ElevenLabs is fine.

Can I really dub a 1-hour podcast in one click with this?

Technically yes with KrillinAI — feed it episode.mp4, pick target language, get back episode-es.mp4 with translated subtitles and dubbed audio. Realistically you'll want a human review pass before publishing, because (1) translation will mangle a few cultural references and inside jokes, (2) the clone will mispronounce proper nouns and acronyms specific to your domain, (3) lip-sync on long-form podcast video is convincing for 80% of clips and visibly off for 20%. Workflow that actually works: KrillinAI for the first pass on a 5-minute promo clip; if quality is good, batch the whole episode; review the transcript for terminology fixes; re-render. End-to-end for a 1-hour episode: ~3 hours human time vs ~3 days for an outsourced translation agency.

What's the fastest video editor for podcast-to-social repurposing?

If you mean cutting 60-second vertical clips out of a 90-minute episode for TikTok/Shorts/Reels: VideoCaptioner is the unlock here, because the big time sink is not the cut — it's animating word-by-word captions on every clip. VideoCaptioner takes the transcript Whisper already gave you and burns animated word-level subtitles into a vertical export. Combine with a simple FFmpeg crop or Shotcut/Kdenlive for the cut itself. If you want a single GUI that does cut + caption + export, OpenCut and Shotcut both work but you'll spend more time per clip. The fast path: edit-by-transcript in Audacity / a text editor, render the cut with FFmpeg, caption with VideoCaptioner, ship.

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs