TOKREPO · Arsenal de IA
Nuevo · esta semana

Pack TTS + STT

Diez selecciones para el dev que construye voicebots, pipelines de transcripción o narradores de audiolibro — variantes Whisper (whisper.cpp / Faster Whisper / WhisperX) para STT, ElevenLabs / Coqui / Bark / StyleTTS 2 / Kokoro para TTS, más OpenVoice para clonación de voz. Complementa voice-ai-stack: aquí los componentes, allí el sustrato realtime.

10 recursos

What's in this pack

This is the components catalog for voice apps. Where the Voice AI Stack pack gives you the realtime substrate (LiveKit, Moshi, OpenAI Realtime, Zonos) for speech-to-speech agents, this pack gives you the discrete STT and TTS engines you compose into the classic cascade architecture: microphone → STT → LLM → TTS → speaker.

The cascade is not dead. It's the right call when you need:

  • Precise control over the LLM step — tool calls, structured output, RAG retrieval, anything where you need to inspect or transform the text.
  • Cost-sensitive workloads at scale — speech-to-speech models are still 3-5x more expensive per minute than a well-tuned cascade.
  • Non-realtime use cases — transcription pipelines, audiobook generation, podcast post-production, voiceovers for video. Latency is not the constraint.
  • Self-hosted or air-gapped deployments — every component here has an open-source option you can run on your own GPU or even CPU.

Ten picks, grouped by layer:

Layer Pick When to reach for it
STT — canonical Whisper The reference. Batch transcription, multilingual, well-known accuracy.
STT — local whisper.cpp Pure C/C++ port. CPU, Apple Silicon, no Python. Mobile and edge.
STT — fast Faster Whisper 4x speedup via CTranslate2. Same accuracy, much less GPU time.
STT — diarized WhisperX 70x faster + word-level timestamps + speaker diarization. Meetings, podcasts.
TTS — commercial ElevenLabs Python SDK Highest perceived quality, streaming, voice cloning. Pay per character.
TTS — open framework Coqui TTS Deep-learning TTS engine with multiple model architectures. Self-host.
TTS — expressive Bark Suno's transformer model. Music, sound effects, non-speech audio. MIT.
TTS — human-level StyleTTS 2 Style diffusion for naturalness that rivals proprietary engines.
TTS — lightweight Kokoro 82M parameters, 9 languages, runs comfortably on a laptop CPU.
Cloning OpenVoice Instant voice cloning with separate tone and style control.

Install in this order

# 1. Pick your STT first — it sets your latency floor
tokrepo install whisper-cpp           # local, CPU
# or
tokrepo install faster-whisper        # GPU, batch + streaming
# or
tokrepo install whisperx              # transcription with diarization

# 2. Add the TTS engine matched to your quality bar
tokrepo install elevenlabs-python-sdk # ship quality, pay per char
# or
tokrepo install coqui-tts             # self-host, decent quality
# or
tokrepo install kokoro                # lightweight, runs anywhere

# 3. Optional — voice cloning for branded narrators
tokrepo install openvoice

The TokRepo CLI drops a skill into your repo per asset. For Claude Code, Cursor, or Codex CLI the skills include working Python snippets and dependency lists; you wire them into your own app loop.

How the cascade actually fits together

[ Microphone / audio file ]
        │
        ▼
[ STT — Whisper variant ]
        │  text + word timestamps
        ▼
[ LLM — your choice ]
        │  reply text + tool calls
        ▼
[ Text normalizer ]
        │  numbers, dates, emoji stripped
        ▼
[ TTS — ElevenLabs / Coqui / Bark / Kokoro ]
        │  streaming audio frames
        ▼
[ Speaker / output file ]

A few things every shipping cascade does right:

  1. Stream both ends. STT emits partial hypotheses every ~200ms; TTS emits audio after the first ~100ms of LLM output. Wire the LLM to stream tokens. End-to-end perceived latency drops from "send-then-wait" to "trickle".
  2. Normalize before TTS. $1,234.56 reads as one-comma-two-three-four-point-five-six on most engines. A 20-line normalizer for currency, dates, abbreviations, and URLs is worth a week of "why does my agent sound dumb".
  3. Cache the boot. Whisper-large takes ~3 seconds to load weights cold. Keep the model warm in a long-lived process; the first transcription should not pay this cost.

Tradeoffs you'll hit

  • Whisper-large vs medium vs tiny. Tiny runs on a Raspberry Pi; large needs a GPU. Most production teams settle on medium plus VAD-aware chunking — it's the accuracy/cost knee. Faster Whisper makes large affordable; whisper.cpp makes tiny/base usable on CPU.
  • ElevenLabs vs open-source TTS. ElevenLabs sounds noticeably better but costs $30-330/month plus per-character overages. Coqui + StyleTTS 2 reach "good enough for production" but require GPU. The cutoff: under 100k chars/day, run ElevenLabs; above, self-host.
  • Bark vs Kokoro vs StyleTTS. Bark is expressive (laughs, music, effects) but slow and not always controllable. Kokoro is fast and tiny but neutral-sounding. StyleTTS 2 is human-level natural but needs the most VRAM. Match the engine to the artifact — Bark for game NPCs, Kokoro for IVR, StyleTTS for audiobooks.
  • Voice cloning ethics. OpenVoice and ElevenLabs both support consent-based cloning. Always require explicit opt-in and log the consent. Unconsented cloning is the one easy way to lose a deal or a lawsuit.

Common pitfalls

  • No VAD on the STT input. Sending continuous silence to Whisper produces hallucinated transcripts ("Thank you for watching!" is the famous one). Run a 30-line webrtcvad or silero-vad filter before Whisper. This single change kills the most common cascade bug.
  • Sending the whole LLM reply to TTS at once. You're paying for the full LLM latency and the full TTS latency sequentially. Stream the LLM tokens into a sentence-buffer; flush a sentence to TTS the moment a . , ? , or ! arrives.
  • Ignoring sample rate mismatches. Whisper expects 16kHz mono. TTS engines output 22.05/24/48 kHz. Resample at the boundaries; mismatched rates produce chipmunk or sub-bass artifacts that QA will blame on the model.
  • Treating WhisperX as a drop-in for Whisper. WhisperX needs pyannote for diarization, which means a Hugging Face token and a license agreement. Plan the auth before you depend on it in production.
  • Forgetting to log audio + transcript pairs. Voice apps regress silently — a TTS update or STT version bump can quietly degrade quality. Sample 1% of sessions, store the audio and transcript, and review weekly. Without this you'll only hear about regressions from angry users.
INSTALAR · UN COMANDO
$ tokrepo install pack/tts-stt-voice-stack
pásalo a tu agente — o pégalo en tu terminal
Qué incluye

10 recursos listos para instalar

Skill#01
Whisper — OpenAI Speech-to-Text

OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.

by OpenAI·215 views
$ tokrepo install whisper-openai-speech-text-eb0f9dd6
Skill#02
whisper.cpp — Local Speech-to-Text in Pure C/C++

High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.

by Script Depot·1601 views
$ tokrepo install whisper-cpp-local-speech-text-pure-c-c-e1fd7c46
Skill#03
Faster Whisper — 4x Faster Speech-to-Text

Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2, up to 4x faster with less memory. 21.8K+ GitHub stars. GPU/CPU, 8-bit quantization, word timestamps, VAD. MIT licensed.

by Script Depot·202 views
$ tokrepo install faster-whisper-4x-faster-speech-text-24576b2c
Skill#04
WhisperX — 70x Faster Speech Recognition

WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.

by Script Depot·237 views
$ tokrepo install whisperx-70x-faster-speech-recognition-c43ad870
Script#05
ElevenLabs Python SDK — AI Text-to-Speech

Official ElevenLabs Python SDK for AI voice generation. Create realistic voiceovers with 30+ languages, voice cloning, and streaming support.

by ElevenLabs·193 views
$ tokrepo install elevenlabs-python-sdk-ai-text-speech-16d32da9
Script#06
Coqui TTS — Deep Learning Text-to-Speech Engine

Generate speech in 1100+ languages with voice cloning. XTTS v2 streams with under 200ms latency. 44K+ GitHub stars.

by TokRepo精选·284 views
$ tokrepo install coqui-tts-deep-learning-text-speech-engine-a059dce2
Skill#07
Bark — AI Text-to-Audio with Music & Effects

Bark is a transformer text-to-audio model by Suno that generates speech, music, and sound effects. 39.1K+ GitHub stars. 12+ languages, 100+ voice presets, non-speech audio. MIT licensed.

by Script Depot·201 views
$ tokrepo install bark-ai-text-audio-music-effects-814b8972
Skill#08
StyleTTS 2 — Human-Level Text-to-Speech via Style Diffusion

A TTS system that achieves human-level speech synthesis through style diffusion and adversarial training with large speech language models. Fast inference with natural prosody.

by Script Depot·106 views
$ tokrepo install styletts-2-human-level-text-speech-via-style-diffusion-e7a8aaaf
Skill#09
OpenVoice — Instant Voice Cloning with Tone and Style Control

OpenVoice is an open-source voice cloning framework from MyShell AI that reproduces a speaker's voice from a short audio sample while giving independent control over emotion, accent, rhythm, and language.

by AI Open Source·89 views
$ tokrepo install openvoice-instant-voice-cloning-tone-style-control-ae7169ee
Skill#10
Kokoro — Lightweight 82M TTS in 9 Languages

Kokoro is an 82M parameter text-to-speech model delivering quality comparable to larger models. 6.2K+ GitHub stars. Supports English, Spanish, French, Japanese, Chinese, and more. Apache 2.0.

by Script Depot·208 views
$ tokrepo install kokoro-lightweight-82m-tts-9-languages-44809dfb
Preguntas frecuentes

Preguntas frecuentes

Why pick a cascade over a speech-to-speech model like Moshi or OpenAI Realtime?

Three reasons. First, control — a cascade lets you intercept the text between STT and TTS for tool calls, RAG, content filtering, or LLM routing, which audio-native models still struggle with. Second, cost — at scale, cascading Whisper + GPT-4o-mini + Kokoro can be 5-10x cheaper per minute than Realtime API. Third, fit — for non-conversational use cases (transcription, audiobook generation, podcast post-production) there's no realtime dialogue to preserve. The Voice AI Stack pack covers the speech-to-speech case; this pack covers everything else.

Which Whisper variant should I actually use?

Start with Faster Whisper if you have a GPU — same accuracy as canonical Whisper at 4x throughput, lower VRAM. Start with whisper.cpp if you're on CPU, Apple Silicon, or edge hardware — it's the only practical option there. Use WhisperX when you need speaker diarization or word-level timestamps (meetings, podcasts, captioning). Use canonical OpenAI Whisper only when you need the reference implementation for paper reproductions or when CTranslate2 doesn't support a model you care about.

How much does a self-hosted TTS actually cost compared to ElevenLabs?

Rough back-of-envelope: a Coqui or StyleTTS 2 setup on a single A10G ($0.75/hr on AWS) can serve roughly 200 hours of audio per GPU-hour at decent quality. That's about $0.004 per minute. ElevenLabs at the Creator tier is closer to $0.03 per minute equivalent. The break-even is around 25-50 hours of audio per day; under that, ElevenLabs is operationally cheaper because you skip the inference infra. Above that, self-host wins. Kokoro shifts the math further — it runs on CPU at usable speed.

Does this work with Claude Code, Cursor, Codex CLI?

Yes. Every entry in this pack is installed as a TokRepo skill, which means it drops a .md skill file plus example Python into your repo for whichever agent CLI you're using. The agent then has full context — API key handling, streaming code, sample-rate conversion, the lot — and can wire it into your app. The Codex CLI and Cursor entries in TokRepo each have voice-agent examples that compose several of these picks.

Can I evaluate TTS and STT quality automatically?

Yes, but the metrics matter. For STT: word error rate (WER) against a held-out transcript set is standard; use jiwer for the math. For TTS: there's no single number — MOS (mean opinion score) needs humans, but UTMOS and NISQA give automated estimates. The realistic eval loop: keep a 50-clip golden set, run WER for STT changes, run a small human MOS panel for TTS changes (5 reviewers, 30 minutes). Don't ship without it — TTS and STT updates regress in directions metrics don't catch.

MÁS DEL ARSENAL

12 packs · 80+ recursos seleccionados

Explora todos los packs curados en la página principal

Volver a todos los packs