Pack TTS + STT
Diez selecciones para el dev que construye voicebots, pipelines de transcripción o narradores de audiolibro — variantes Whisper (whisper.cpp / Faster Whisper / WhisperX) para STT, ElevenLabs / Coqui / Bark / StyleTTS 2 / Kokoro para TTS, más OpenVoice para clonación de voz. Complementa voice-ai-stack: aquí los componentes, allí el sustrato realtime.
What's in this pack
This is the components catalog for voice apps. Where the Voice AI Stack pack gives you the realtime substrate (LiveKit, Moshi, OpenAI Realtime, Zonos) for speech-to-speech agents, this pack gives you the discrete STT and TTS engines you compose into the classic cascade architecture: microphone → STT → LLM → TTS → speaker.
The cascade is not dead. It's the right call when you need:
- Precise control over the LLM step — tool calls, structured output, RAG retrieval, anything where you need to inspect or transform the text.
- Cost-sensitive workloads at scale — speech-to-speech models are still 3-5x more expensive per minute than a well-tuned cascade.
- Non-realtime use cases — transcription pipelines, audiobook generation, podcast post-production, voiceovers for video. Latency is not the constraint.
- Self-hosted or air-gapped deployments — every component here has an open-source option you can run on your own GPU or even CPU.
Ten picks, grouped by layer:
| Layer | Pick | When to reach for it |
|---|---|---|
| STT — canonical | Whisper | The reference. Batch transcription, multilingual, well-known accuracy. |
| STT — local | whisper.cpp | Pure C/C++ port. CPU, Apple Silicon, no Python. Mobile and edge. |
| STT — fast | Faster Whisper | 4x speedup via CTranslate2. Same accuracy, much less GPU time. |
| STT — diarized | WhisperX | 70x faster + word-level timestamps + speaker diarization. Meetings, podcasts. |
| TTS — commercial | ElevenLabs Python SDK | Highest perceived quality, streaming, voice cloning. Pay per character. |
| TTS — open framework | Coqui TTS | Deep-learning TTS engine with multiple model architectures. Self-host. |
| TTS — expressive | Bark | Suno's transformer model. Music, sound effects, non-speech audio. MIT. |
| TTS — human-level | StyleTTS 2 | Style diffusion for naturalness that rivals proprietary engines. |
| TTS — lightweight | Kokoro | 82M parameters, 9 languages, runs comfortably on a laptop CPU. |
| Cloning | OpenVoice | Instant voice cloning with separate tone and style control. |
Install in this order
# 1. Pick your STT first — it sets your latency floor
tokrepo install whisper-cpp # local, CPU
# or
tokrepo install faster-whisper # GPU, batch + streaming
# or
tokrepo install whisperx # transcription with diarization
# 2. Add the TTS engine matched to your quality bar
tokrepo install elevenlabs-python-sdk # ship quality, pay per char
# or
tokrepo install coqui-tts # self-host, decent quality
# or
tokrepo install kokoro # lightweight, runs anywhere
# 3. Optional — voice cloning for branded narrators
tokrepo install openvoice
The TokRepo CLI drops a skill into your repo per asset. For Claude Code, Cursor, or Codex CLI the skills include working Python snippets and dependency lists; you wire them into your own app loop.
How the cascade actually fits together
[ Microphone / audio file ]
│
▼
[ STT — Whisper variant ]
│ text + word timestamps
▼
[ LLM — your choice ]
│ reply text + tool calls
▼
[ Text normalizer ]
│ numbers, dates, emoji stripped
▼
[ TTS — ElevenLabs / Coqui / Bark / Kokoro ]
│ streaming audio frames
▼
[ Speaker / output file ]
A few things every shipping cascade does right:
- Stream both ends. STT emits partial hypotheses every ~200ms; TTS emits audio after the first ~100ms of LLM output. Wire the LLM to stream tokens. End-to-end perceived latency drops from "send-then-wait" to "trickle".
- Normalize before TTS.
$1,234.56reads as one-comma-two-three-four-point-five-six on most engines. A 20-line normalizer for currency, dates, abbreviations, and URLs is worth a week of "why does my agent sound dumb". - Cache the boot. Whisper-large takes ~3 seconds to load weights cold. Keep the model warm in a long-lived process; the first transcription should not pay this cost.
Tradeoffs you'll hit
- Whisper-large vs medium vs tiny. Tiny runs on a Raspberry Pi; large needs a GPU. Most production teams settle on medium plus VAD-aware chunking — it's the accuracy/cost knee. Faster Whisper makes large affordable; whisper.cpp makes tiny/base usable on CPU.
- ElevenLabs vs open-source TTS. ElevenLabs sounds noticeably better but costs $30-330/month plus per-character overages. Coqui + StyleTTS 2 reach "good enough for production" but require GPU. The cutoff: under 100k chars/day, run ElevenLabs; above, self-host.
- Bark vs Kokoro vs StyleTTS. Bark is expressive (laughs, music, effects) but slow and not always controllable. Kokoro is fast and tiny but neutral-sounding. StyleTTS 2 is human-level natural but needs the most VRAM. Match the engine to the artifact — Bark for game NPCs, Kokoro for IVR, StyleTTS for audiobooks.
- Voice cloning ethics. OpenVoice and ElevenLabs both support consent-based cloning. Always require explicit opt-in and log the consent. Unconsented cloning is the one easy way to lose a deal or a lawsuit.
Common pitfalls
- No VAD on the STT input. Sending continuous silence to Whisper produces hallucinated transcripts ("Thank you for watching!" is the famous one). Run a 30-line
webrtcvadorsilero-vadfilter before Whisper. This single change kills the most common cascade bug. - Sending the whole LLM reply to TTS at once. You're paying for the full LLM latency and the full TTS latency sequentially. Stream the LLM tokens into a sentence-buffer; flush a sentence to TTS the moment a
.,?, or!arrives. - Ignoring sample rate mismatches. Whisper expects 16kHz mono. TTS engines output 22.05/24/48 kHz. Resample at the boundaries; mismatched rates produce chipmunk or sub-bass artifacts that QA will blame on the model.
- Treating WhisperX as a drop-in for Whisper. WhisperX needs
pyannotefor diarization, which means a Hugging Face token and a license agreement. Plan the auth before you depend on it in production. - Forgetting to log audio + transcript pairs. Voice apps regress silently — a TTS update or STT version bump can quietly degrade quality. Sample 1% of sessions, store the audio and transcript, and review weekly. Without this you'll only hear about regressions from angry users.
10 recursos listos para instalar
Preguntas frecuentes
Why pick a cascade over a speech-to-speech model like Moshi or OpenAI Realtime?
Three reasons. First, control — a cascade lets you intercept the text between STT and TTS for tool calls, RAG, content filtering, or LLM routing, which audio-native models still struggle with. Second, cost — at scale, cascading Whisper + GPT-4o-mini + Kokoro can be 5-10x cheaper per minute than Realtime API. Third, fit — for non-conversational use cases (transcription, audiobook generation, podcast post-production) there's no realtime dialogue to preserve. The Voice AI Stack pack covers the speech-to-speech case; this pack covers everything else.
Which Whisper variant should I actually use?
Start with Faster Whisper if you have a GPU — same accuracy as canonical Whisper at 4x throughput, lower VRAM. Start with whisper.cpp if you're on CPU, Apple Silicon, or edge hardware — it's the only practical option there. Use WhisperX when you need speaker diarization or word-level timestamps (meetings, podcasts, captioning). Use canonical OpenAI Whisper only when you need the reference implementation for paper reproductions or when CTranslate2 doesn't support a model you care about.
How much does a self-hosted TTS actually cost compared to ElevenLabs?
Rough back-of-envelope: a Coqui or StyleTTS 2 setup on a single A10G ($0.75/hr on AWS) can serve roughly 200 hours of audio per GPU-hour at decent quality. That's about $0.004 per minute. ElevenLabs at the Creator tier is closer to $0.03 per minute equivalent. The break-even is around 25-50 hours of audio per day; under that, ElevenLabs is operationally cheaper because you skip the inference infra. Above that, self-host wins. Kokoro shifts the math further — it runs on CPU at usable speed.
Does this work with Claude Code, Cursor, Codex CLI?
Yes. Every entry in this pack is installed as a TokRepo skill, which means it drops a .md skill file plus example Python into your repo for whichever agent CLI you're using. The agent then has full context — API key handling, streaming code, sample-rate conversion, the lot — and can wire it into your app. The Codex CLI and Cursor entries in TokRepo each have voice-agent examples that compose several of these picks.
Can I evaluate TTS and STT quality automatically?
Yes, but the metrics matter. For STT: word error rate (WER) against a held-out transcript set is standard; use jiwer for the math. For TTS: there's no single number — MOS (mean opinion score) needs humans, but UTMOS and NISQA give automated estimates. The realistic eval loop: keep a 50-clip golden set, run WER for STT changes, run a small human MOS panel for TTS changes (5 reviewers, 30 minutes). Don't ship without it — TTS and STT updates regress in directions metrics don't catch.
12 packs · 80+ recursos seleccionados
Explora todos los packs curados en la página principal
Volver a todos los packs