Voice Cloning + Podcast — One Person Runs the Whole Show
Ten picks for the indie podcaster, voice actor, or YouTuber running a whole show solo — Audacity for capture and noise cleanup, Whisper / whisper.cpp for transcription, ElevenLabs / OpenVoice / GPT-SoVITS / Fish Speech / Coqui TTS for voice clone and multilingual dubbing, KrillinAI for one-click 100-language video dub, VideoCaptioner for subtitle baking. Recording → cleanup → clone → dub → publish, in one rig.
What's in this pack
This is the rig an indie podcaster, voice actor, or YouTuber would build to run a whole show without a producer, sound engineer, or translation agency. Ten picks, opinionated order, every one of them either open-source or has a serious free tier. The point is not "all the tools that exist" — it's "the smallest set that lets one person record on Monday and ship a localized, captioned, denoised, voice-cloned cut by Friday".
Five layers, two picks per layer where there's a real tradeoff:
| Layer | Picks | Why |
|---|---|---|
| 1. Record + clean | Audacity | Free DAW. Records multi-track, removes hiss/breath/click, exports anything. |
| 2. Transcribe | Whisper (cloud) · whisper.cpp (local) | Cloud Whisper for highest accuracy; whisper.cpp for offline / sensitive / batch / mobile. |
| 3. Clone your voice | ElevenLabs · OpenVoice · GPT-SoVITS | ElevenLabs = top fidelity, paid. OpenVoice = instant tone+style clone, MIT. GPT-SoVITS = few-shot clone you self-host. |
| 4. Dub into other languages | Fish Speech · Coqui TTS · KrillinAI | Fish Speech does 80+ languages. Coqui TTS = pluggable engine. KrillinAI takes a video file and dubs the whole thing in one click. |
| 5. Caption + ship | VideoCaptioner | Burns word-level subtitles into vertical cuts for TikTok / Reels / Shorts. |
The pack is sized for one operator. If you're running a 3-person podcast network with editors, swap Audacity for Reaper / Adobe Audition (paid), swap KrillinAI for a human translation pass, and add a publish/scheduling tool. For everyone else, this is the rig.
Install in this order
Do NOT install the voice clone tools first. You need a clean recording before cloning gives a usable result.
# Stage 1 — capture and clean (Monday)
tokrepo install audacity
# Stage 2 — get a transcript so you can edit by text, not by waveform (Monday night)
tokrepo install whisper-cpp # local, free, ~5x realtime on M-series
# OR
tokrepo install whisper # OpenAI API, highest accuracy
# Stage 3 — clone your own voice (Tuesday — you only do this once)
tokrepo install elevenlabs-python-sdk # 3 min of clean audio → studio-grade clone
# OR — if you want to self-host / not pay per character
tokrepo install openvoice # instant clone, MIT
tokrepo install gpt-sovits # few-shot, GPU recommended
# Stage 4 — dub a clip into other languages (Wednesday)
tokrepo install fish-speech # multilingual TTS, 80+ languages
tokrepo install coqui-tts # self-hosted alternative
tokrepo install krillinai # full-video dub, subtitles+voice, one command
# Stage 5 — publish (Thursday)
tokrepo install videocaptioner # burn animated captions for social cuts
The TokRepo CLI drops each asset as a skill file in your repo. Claude Code / Cursor / Codex CLI read the skill and can wire up the script for you — "take episode-12.wav, denoise it in Audacity headless, transcribe with whisper.cpp, dub the first 60 seconds into Spanish with KrillinAI, burn captions with VideoCaptioner, output ep12-es.mp4" becomes a single agent prompt.
How they fit together
[ Mic / Riverside / Zoom recording ]
│
▼
┌─────────────────────┐
│ Audacity │ noise gate, EQ, normalize, click removal
└─────────────────────┘
│ clean WAV
▼
┌─────────────────────┐
│ Whisper / whisper.cpp │ transcript + word timestamps
└─────────────────────┘
│ edit by deleting text, not waveform
▼
┌─────────────────────────────────┐
│ Voice clone (one of): │
│ ElevenLabs · OpenVoice · │ → your-voice model
│ GPT-SoVITS │
└─────────────────────────────────┘
│
├──► re-record a flub: type the line, your-voice speaks it
│
▼
┌─────────────────────────────────┐
│ Multilingual dub (one of): │
│ Fish Speech (TTS engine) · │
│ Coqui TTS · KrillinAI │ → ES / JA / DE / FR audio track
│ (full video pipeline) │
└─────────────────────────────────┘
│
▼
┌─────────────────────┐
│ VideoCaptioner │ word-by-word burned captions, vertical cut
└─────────────────────┘
│
▼
[ YouTube / Spotify / TikTok / Reels / Shorts ]
The big unlock here is editing by transcript, not by waveform. Once Whisper gives you a timestamped transcript, removing an um/uh becomes deleting a word from a text file and re-rendering. That's where the 5x speed-up actually comes from — not the cloning, not the dubbing, but never having to scrub through a 90-minute waveform.
Tradeoffs you'll hit
- ElevenLabs vs OpenVoice vs GPT-SoVITS for cloning your own voice. ElevenLabs is the fidelity ceiling — 3 minutes of clean audio gets you a clone friends can't tell apart, but it's $5–$330/month + character overage and your voice model lives on their servers. OpenVoice is MIT-licensed and runs on a consumer GPU; quality is "good enough for podcast intros, not narration". GPT-SoVITS is the strongest open option but needs a fine-tune pass per voice. Pick ElevenLabs for fastest result, OpenVoice/GPT-SoVITS if licensing or recurring cost matters.
- Cloud Whisper vs whisper.cpp. Cloud is the most accurate, especially on Chinese/Japanese/proper nouns. whisper.cpp runs on a MacBook with no internet, no per-minute cost, no data leaving your machine. Podcasts with named guests → cloud. Locked-down corporate / journalism with sources → local.
- KrillinAI vs DIY (Fish Speech + Coqui). KrillinAI takes a video file and gives you the same video in a new language, lips kind of synced, subtitles included — one command. The DIY path (extract audio → transcribe → translate → re-TTS → mux back in) gives you control over each step but is 5x the integration work. Use KrillinAI for first pass; drop down to DIY when one step needs tuning.
- Multilingual fidelity reality check. Chinese/Japanese/Korean clones from English-trained voice models will sound "foreign-accented". Fish Speech is the strongest multilingual TTS in this pack. For mission-critical localization (paid clients) you still want a native voice actor for the target language; clones get you to draft quality, not broadcast.
- Realtime vs offline. Nothing in this pack is realtime — this is a production studio, not a live-stream rig. If you need live, look at Voice AI Stack pack instead.
Common pitfalls (and the ethical one)
- You don't own the rights to clone someone else's voice. Cloning a guest, a public figure, a deceased person, or any voice you don't have explicit written consent from is a fast track to a lawsuit, a platform ban, and (in many jurisdictions) criminal liability. ElevenLabs requires a consent-recording before voice cloning. OpenVoice and GPT-SoVITS do not enforce this — you must. Get written consent before you clone anyone, and log it.
- Model bias generates accents you didn't want. Voice cloning models trained predominantly on American English will make your Indian-English / Australian / Scottish accent sound subtly "American". Test the clone across your whole accent range before committing to a season of episodes.
- Proper-noun transcription error rate. Whisper hallucinates names. "Linus Torvalds" comes out "Linus Torvalds" 90% of the time; "Anthropic" comes out "and topic". Build a custom vocabulary / post-process replace list for every recurring name on your show.
- Long-audio token cost. Transcribing a 2-hour podcast through cloud Whisper is fine ($0.36 at $0.006/min). Dubbing a 2-hour podcast through ElevenLabs at the multilingual rate ≈ 100k chars/hour ≈ $20–60 per language per episode. Run the math before you promise "every episode in 10 languages".
- VAD before everything. If you skip voice-activity detection and feed silent gaps to Whisper, you'll get the famous hallucinated transcript
Thank you for watching!baked into your subtitles. Add a 30-linesilero-vadpass before any STT call. - Not keeping the original master. Voice clone + re-mix + re-dub is a destructive chain. Always keep the original multi-track Audacity project — clients, lawyers, and future-you will all need it.
Ethical disclaimer
Voice cloning has legitimate uses: re-recording your own flubs, accessibility narration, dubbing your own content into languages you don't speak, voice preservation for ALS patients. It also has obvious abuses: impersonation fraud, non-consensual deepfakes, putting words in a public figure's mouth. This pack ships the tools. The rules are on you. Get explicit written consent before cloning any voice that isn't your own. Disclose AI-generated audio in your show notes. Many platforms (YouTube, TikTok, Spotify) now require disclosure of synthetic media and will demonetize / remove content that hides it. Build the disclosure into your publish step from day one.
10 assets in this pack
Frequently asked questions
Is it legal to clone my own voice?
Cloning your own voice for your own use is legal in essentially every jurisdiction. The trouble starts when you (1) clone a voice you don't have rights to — a guest, a celebrity, a deceased person; (2) use a clone to impersonate someone for fraud or defamation, even your own clone in someone else's hands; or (3) hide that audio is AI-generated on a platform that requires disclosure (YouTube, TikTok, Spotify, Meta all do now). For your own podcast intros, narration patches, and translated dubs of your own content, you're fine. For anything involving a second person, get written consent.
ElevenLabs vs Fish Speech vs OpenVoice — which one for what?
ElevenLabs is the quality leader for English/Spanish/German and a paid SaaS — pick it when fidelity matters more than recurring cost and you're okay with a cloud dependency. Fish Speech is the best open multilingual TTS in this pack — it covers 80+ languages including strong Chinese and Japanese, runs on your GPU, and is what you reach for when ElevenLabs sounds "too foreign" in your target language. OpenVoice is the fastest open clone — 3-second reference audio, MIT-licensed, runs on a consumer GPU, but quality tops out around "good podcast intro" not "broadcast narration". Typical setup: ElevenLabs for your main voice clone, Fish Speech for Chinese/Japanese dubs, OpenVoice for one-off character voices.
Which voice clone has the best Chinese quality?
For Chinese specifically: GPT-SoVITS and Fish Speech are both stronger than ElevenLabs out of the box, because they're trained on much more Chinese data. GPT-SoVITS in particular has a strong Chinese community and most public few-shot tutorials are Chinese-language. ElevenLabs has improved Chinese significantly in the last year but still has noticeable English-influenced tonal artifacts on the 4 tones. For a Chinese-language podcast or dub track, fine-tune GPT-SoVITS or Fish Speech on ~30 minutes of clean Mandarin reference; for a single Chinese sentence in an otherwise English show, ElevenLabs is fine.
Can I really dub a 1-hour podcast in one click with this?
Technically yes with KrillinAI — feed it episode.mp4, pick target language, get back episode-es.mp4 with translated subtitles and dubbed audio. Realistically you'll want a human review pass before publishing, because (1) translation will mangle a few cultural references and inside jokes, (2) the clone will mispronounce proper nouns and acronyms specific to your domain, (3) lip-sync on long-form podcast video is convincing for 80% of clips and visibly off for 20%. Workflow that actually works: KrillinAI for the first pass on a 5-minute promo clip; if quality is good, batch the whole episode; review the transcript for terminology fixes; re-render. End-to-end for a 1-hour episode: ~3 hours human time vs ~3 days for an outsourced translation agency.
What's the fastest video editor for podcast-to-social repurposing?
If you mean cutting 60-second vertical clips out of a 90-minute episode for TikTok/Shorts/Reels: VideoCaptioner is the unlock here, because the big time sink is not the cut — it's animating word-by-word captions on every clip. VideoCaptioner takes the transcript Whisper already gave you and burns animated word-level subtitles into a vertical export. Combine with a simple FFmpeg crop or Shotcut/Kdenlive for the cut itself. If you want a single GUI that does cut + caption + export, OpenCut and Shotcut both work but you'll spend more time per clip. The fast path: edit-by-transcript in Audacity / a text editor, render the cut with FFmpeg, caption with VideoCaptioner, ship.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs