[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-tts-stt-voice-stack-es":3,"seo:pack:tts-stt-voice-stack:es":98},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":97},"tts-stt-voice-stack","🎙️","#F97316","new","Nuevo · esta semana","Pack TTS + STT","Diez selecciones para el dev que construye voicebots, pipelines de transcripción o narradores de audiolibro — variantes Whisper (whisper.cpp \u002F Faster Whisper \u002F WhisperX) para STT, ElevenLabs \u002F Coqui \u002F Bark \u002F StyleTTS 2 \u002F Kokoro para TTS, más OpenVoice para clonación de voz. Complementa voice-ai-stack: aquí los componentes, allí el sustrato realtime.",[16,28,36,43,50,60,68,75,82,90],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},105,"eb0f9dd6-2172-4c9f-aca9-97846b0f4d86","whisper-openai-speech-text-eb0f9dd6","Whisper — OpenAI Speech-to-Text","OpenAI's open-source speech recognition model. Transcribe audio\u002Fvideo to text with word-level timestamps in 99 languages. Essential for subtitle generation.","OpenAI",221,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":34,"view_count":35,"vote_count":24,"lang_type":25,"type":26,"type_label":27},390,"e1fd7c46-bbda-4956-8649-9c3ed579ff25","whisper-cpp-local-speech-text-pure-c-c-e1fd7c46","whisper.cpp — Local Speech-to-Text in Pure C\u002FC++","High-performance port of OpenAI Whisper in C\u002FC++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.","Script Depot",1602,{"id":37,"uuid":38,"slug":39,"title":40,"description":41,"author_name":34,"view_count":42,"vote_count":24,"lang_type":25,"type":26,"type_label":27},270,"24576b2c-a9d1-4f7a-9696-b1e5c50a17f3","faster-whisper-4x-faster-speech-text-24576b2c","Faster Whisper — 4x Faster Speech-to-Text","Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2, up to 4x faster with less memory. 21.8K+ GitHub stars. GPU\u002FCPU, 8-bit quantization, word timestamps, VAD. MIT licensed.",202,{"id":44,"uuid":45,"slug":46,"title":47,"description":48,"author_name":34,"view_count":49,"vote_count":24,"lang_type":25,"type":26,"type_label":27},287,"c43ad870-8c99-471a-898e-b07140faf532","whisperx-70x-faster-speech-recognition-c43ad870","WhisperX — 70x Faster Speech Recognition","WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.",237,{"id":51,"uuid":52,"slug":53,"title":54,"description":55,"author_name":56,"view_count":57,"vote_count":24,"lang_type":25,"type":58,"type_label":59},106,"16d32da9-c5fb-43ae-b881-8444b2dcd35b","elevenlabs-python-sdk-ai-text-speech-16d32da9","ElevenLabs Python SDK — AI Text-to-Speech","Official ElevenLabs Python SDK for AI voice generation. Create realistic voiceovers with 30+ languages, voice cloning, and streaming support.","ElevenLabs",194,"script","Script",{"id":61,"uuid":62,"slug":63,"title":64,"description":65,"author_name":66,"view_count":67,"vote_count":24,"lang_type":25,"type":58,"type_label":59},423,"a059dce2-6275-4ea0-a57b-e885248d8e95","coqui-tts-deep-learning-text-speech-engine-a059dce2","Coqui TTS — Deep Learning Text-to-Speech Engine","Generate speech in 1100+ languages with voice cloning. XTTS v2 streams with under 200ms latency. 44K+ GitHub stars.","TokRepo精选",286,{"id":69,"uuid":70,"slug":71,"title":72,"description":73,"author_name":34,"view_count":74,"vote_count":24,"lang_type":25,"type":26,"type_label":27},279,"814b8972-5d48-4379-9756-9a3d8ed686f7","bark-ai-text-audio-music-effects-814b8972","Bark — AI Text-to-Audio with Music & Effects","Bark is a transformer text-to-audio model by Suno that generates speech, music, and sound effects. 39.1K+ GitHub stars. 12+ languages, 100+ voice presets, non-speech audio. MIT licensed.",201,{"id":76,"uuid":77,"slug":78,"title":79,"description":80,"author_name":34,"view_count":81,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2462,"e7a8aaaf-453a-11f1-9bc6-00163e2b0d79","styletts-2-human-level-text-speech-via-style-diffusion-e7a8aaaf","StyleTTS 2 — Human-Level Text-to-Speech via Style Diffusion","A TTS system that achieves human-level speech synthesis through style diffusion and adversarial training with large speech language models. Fast inference with natural prosody.",108,{"id":83,"uuid":84,"slug":85,"title":86,"description":87,"author_name":88,"view_count":89,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2265,"ae7169ee-42b9-11f1-9bc6-00163e2b0d79","openvoice-instant-voice-cloning-tone-style-control-ae7169ee","OpenVoice — Instant Voice Cloning with Tone and Style Control","OpenVoice is an open-source voice cloning framework from MyShell AI that reproduces a speaker's voice from a short audio sample while giving independent control over emotion, accent, rhythm, and language.","AI Open Source",90,{"id":91,"uuid":92,"slug":93,"title":94,"description":95,"author_name":34,"view_count":96,"vote_count":24,"lang_type":25,"type":26,"type_label":27},275,"44809dfb-1735-4aae-af74-f21f4b805d0f","kokoro-lightweight-82m-tts-9-languages-44809dfb","Kokoro — Lightweight 82M TTS in 9 Languages","Kokoro is an 82M parameter text-to-speech model delivering quality comparable to larger models. 6.2K+ GitHub stars. Supports English, Spanish, French, Japanese, Chinese, and more. Apache 2.0.",208,"tokrepo install pack\u002Ftts-stt-voice-stack",{"pageType":99,"pageKey":8,"locale":25,"title":100,"metaDescription":101,"h1":102,"tldr":103,"bodyMarkdown":104,"faq":105,"schema":121,"internalLinks":130,"citations":143,"wordCount":156,"generatedAt":157},"pack","TTS + STT Voice Stack — Whisper, ElevenLabs, Coqui, Bark, StyleTTS","Ten picks for shipping voicebots, transcription pipelines, and audiobook narrators. Whisper variants for STT, ElevenLabs\u002FCoqui\u002FBark\u002FStyleTTS 2\u002FKokoro for TTS, OpenVoice for cloning. Install via TokRepo.","TTS + STT Voice Stack","Ten components for the cascade architecture — pick a Whisper variant for STT, an LLM in the middle, and a TTS engine sized to your latency budget. Bark for expressive audio, Kokoro for laptop-CPU narrators, ElevenLabs when quality wins, Coqui\u002FStyleTTS for self-hosted, OpenVoice for cloning.","## What's in this pack\n\nThis is the **components catalog** for voice apps. Where the [Voice AI Stack pack](\u002Fen\u002Fpacks\u002Fvoice-ai-stack) gives you the realtime substrate (LiveKit, Moshi, OpenAI Realtime, Zonos) for speech-to-speech agents, this pack gives you the discrete STT and TTS engines you compose into the classic **cascade architecture**: `microphone → STT → LLM → TTS → speaker`.\n\nThe cascade is not dead. It's the right call when you need:\n\n- **Precise control over the LLM step** — tool calls, structured output, RAG retrieval, anything where you need to inspect or transform the text.\n- **Cost-sensitive workloads at scale** — speech-to-speech models are still 3-5x more expensive per minute than a well-tuned cascade.\n- **Non-realtime use cases** — transcription pipelines, audiobook generation, podcast post-production, voiceovers for video. Latency is not the constraint.\n- **Self-hosted or air-gapped deployments** — every component here has an open-source option you can run on your own GPU or even CPU.\n\nTen picks, grouped by layer:\n\n| Layer | Pick | When to reach for it |\n|---|---|---|\n| STT — canonical | Whisper | The reference. Batch transcription, multilingual, well-known accuracy. |\n| STT — local | whisper.cpp | Pure C\u002FC++ port. CPU, Apple Silicon, no Python. Mobile and edge. |\n| STT — fast | Faster Whisper | 4x speedup via CTranslate2. Same accuracy, much less GPU time. |\n| STT — diarized | WhisperX | 70x faster + word-level timestamps + speaker diarization. Meetings, podcasts. |\n| TTS — commercial | ElevenLabs Python SDK | Highest perceived quality, streaming, voice cloning. Pay per character. |\n| TTS — open framework | Coqui TTS | Deep-learning TTS engine with multiple model architectures. Self-host. |\n| TTS — expressive | Bark | Suno's transformer model. Music, sound effects, non-speech audio. MIT. |\n| TTS — human-level | StyleTTS 2 | Style diffusion for naturalness that rivals proprietary engines. |\n| TTS — lightweight | Kokoro | 82M parameters, 9 languages, runs comfortably on a laptop CPU. |\n| Cloning | OpenVoice | Instant voice cloning with separate tone and style control. |\n\n## Install in this order\n\n```bash\n# 1. Pick your STT first — it sets your latency floor\ntokrepo install whisper-cpp           # local, CPU\n# or\ntokrepo install faster-whisper        # GPU, batch + streaming\n# or\ntokrepo install whisperx              # transcription with diarization\n\n# 2. Add the TTS engine matched to your quality bar\ntokrepo install elevenlabs-python-sdk # ship quality, pay per char\n# or\ntokrepo install coqui-tts             # self-host, decent quality\n# or\ntokrepo install kokoro                # lightweight, runs anywhere\n\n# 3. Optional — voice cloning for branded narrators\ntokrepo install openvoice\n```\n\nThe TokRepo CLI drops a skill into your repo per asset. For Claude Code, Cursor, or Codex CLI the skills include working Python snippets and dependency lists; you wire them into your own app loop.\n\n## How the cascade actually fits together\n\n```\n[ Microphone \u002F audio file ]\n        │\n        ▼\n[ STT — Whisper variant ]\n        │  text + word timestamps\n        ▼\n[ LLM — your choice ]\n        │  reply text + tool calls\n        ▼\n[ Text normalizer ]\n        │  numbers, dates, emoji stripped\n        ▼\n[ TTS — ElevenLabs \u002F Coqui \u002F Bark \u002F Kokoro ]\n        │  streaming audio frames\n        ▼\n[ Speaker \u002F output file ]\n```\n\nA few things every shipping cascade does right:\n\n1. **Stream both ends.** STT emits partial hypotheses every ~200ms; TTS emits audio after the first ~100ms of LLM output. Wire the LLM to stream tokens. End-to-end perceived latency drops from \"send-then-wait\" to \"trickle\".\n2. **Normalize before TTS.** `$1,234.56` reads as one-comma-two-three-four-point-five-six on most engines. A 20-line normalizer for currency, dates, abbreviations, and URLs is worth a week of \"why does my agent sound dumb\".\n3. **Cache the boot.** Whisper-large takes ~3 seconds to load weights cold. Keep the model warm in a long-lived process; the first transcription should not pay this cost.\n\n## Tradeoffs you'll hit\n\n- **Whisper-large vs medium vs tiny.** Tiny runs on a Raspberry Pi; large needs a GPU. Most production teams settle on medium plus VAD-aware chunking — it's the accuracy\u002Fcost knee. Faster Whisper makes large affordable; whisper.cpp makes tiny\u002Fbase usable on CPU.\n- **ElevenLabs vs open-source TTS.** ElevenLabs sounds noticeably better but costs $30-330\u002Fmonth plus per-character overages. Coqui + StyleTTS 2 reach \"good enough for production\" but require GPU. The cutoff: under 100k chars\u002Fday, run ElevenLabs; above, self-host.\n- **Bark vs Kokoro vs StyleTTS.** Bark is *expressive* (laughs, music, effects) but slow and not always controllable. Kokoro is *fast and tiny* but neutral-sounding. StyleTTS 2 is *human-level natural* but needs the most VRAM. Match the engine to the artifact — Bark for game NPCs, Kokoro for IVR, StyleTTS for audiobooks.\n- **Voice cloning ethics.** OpenVoice and ElevenLabs both support consent-based cloning. Always require explicit opt-in and log the consent. Unconsented cloning is the one easy way to lose a deal or a lawsuit.\n\n## Common pitfalls\n\n- **No VAD on the STT input.** Sending continuous silence to Whisper produces hallucinated transcripts (\"Thank you for watching!\" is the famous one). Run a 30-line `webrtcvad` or `silero-vad` filter before Whisper. This single change kills the most common cascade bug.\n- **Sending the whole LLM reply to TTS at once.** You're paying for the full LLM latency *and* the full TTS latency sequentially. Stream the LLM tokens into a sentence-buffer; flush a sentence to TTS the moment a `. `, `? `, or `! ` arrives.\n- **Ignoring sample rate mismatches.** Whisper expects 16kHz mono. TTS engines output 22.05\u002F24\u002F48 kHz. Resample at the boundaries; mismatched rates produce chipmunk or sub-bass artifacts that QA will blame on the model.\n- **Treating WhisperX as a drop-in for Whisper.** WhisperX needs `pyannote` for diarization, which means a Hugging Face token and a license agreement. Plan the auth before you depend on it in production.\n- **Forgetting to log audio + transcript pairs.** Voice apps regress silently — a TTS update or STT version bump can quietly degrade quality. Sample 1% of sessions, store the audio and transcript, and review weekly. Without this you'll only hear about regressions from angry users.",[106,109,112,115,118],{"q":107,"a":108},"Why pick a cascade over a speech-to-speech model like Moshi or OpenAI Realtime?","Three reasons. First, control — a cascade lets you intercept the text between STT and TTS for tool calls, RAG, content filtering, or LLM routing, which audio-native models still struggle with. Second, cost — at scale, cascading Whisper + GPT-4o-mini + Kokoro can be 5-10x cheaper per minute than Realtime API. Third, fit — for non-conversational use cases (transcription, audiobook generation, podcast post-production) there's no realtime dialogue to preserve. The [Voice AI Stack pack](\u002Fen\u002Fpacks\u002Fvoice-ai-stack) covers the speech-to-speech case; this pack covers everything else.",{"q":110,"a":111},"Which Whisper variant should I actually use?","Start with Faster Whisper if you have a GPU — same accuracy as canonical Whisper at 4x throughput, lower VRAM. Start with whisper.cpp if you're on CPU, Apple Silicon, or edge hardware — it's the only practical option there. Use WhisperX when you need speaker diarization or word-level timestamps (meetings, podcasts, captioning). Use canonical OpenAI Whisper only when you need the reference implementation for paper reproductions or when CTranslate2 doesn't support a model you care about.",{"q":113,"a":114},"How much does a self-hosted TTS actually cost compared to ElevenLabs?","Rough back-of-envelope: a Coqui or StyleTTS 2 setup on a single A10G ($0.75\u002Fhr on AWS) can serve roughly 200 hours of audio per GPU-hour at decent quality. That's about $0.004 per minute. ElevenLabs at the Creator tier is closer to $0.03 per minute equivalent. The break-even is around 25-50 hours of audio per day; under that, ElevenLabs is operationally cheaper because you skip the inference infra. Above that, self-host wins. Kokoro shifts the math further — it runs on CPU at usable speed.",{"q":116,"a":117},"Does this work with Claude Code, Cursor, Codex CLI?","Yes. Every entry in this pack is installed as a TokRepo skill, which means it drops a `.md` skill file plus example Python into your repo for whichever agent CLI you're using. The agent then has full context — API key handling, streaming code, sample-rate conversion, the lot — and can wire it into your app. The Codex CLI and Cursor entries in TokRepo each have voice-agent examples that compose several of these picks.",{"q":119,"a":120},"Can I evaluate TTS and STT quality automatically?","Yes, but the metrics matter. For STT: word error rate (WER) against a held-out transcript set is standard; use `jiwer` for the math. For TTS: there's no single number — MOS (mean opinion score) needs humans, but UTMOS and NISQA give automated estimates. The realistic eval loop: keep a 50-clip golden set, run WER for STT changes, run a small human MOS panel for TTS changes (5 reviewers, 30 minutes). Don't ship without it — TTS and STT updates regress in directions metrics don't catch.",{"@context":122,"@type":123,"name":102,"description":124,"numberOfItems":125,"inLanguage":25,"publisher":126},"https:\u002F\u002Fschema.org","CollectionPage","Ten TTS and STT components for cascade voice apps — Whisper variants, ElevenLabs, Coqui, Bark, StyleTTS 2, Kokoro, OpenVoice.",10,{"@type":127,"name":128,"url":129},"Organization","TokRepo","https:\u002F\u002Ftokrepo.com",[131,135,139],{"url":132,"anchor":133,"reason":134},"\u002Fen\u002Fpacks\u002Fvoice-ai-stack","Voice AI Stack","the realtime substrate (LiveKit, Moshi, OpenAI Realtime) — pair this pack's components with that one's runtime",{"url":136,"anchor":137,"reason":138},"\u002Fen\u002Fpacks\u002Fml-engineer-rag-eval","ML Engineer RAG + Eval pack","eval methodology carries over from RAG to STT\u002FTTS quality tracking",{"url":140,"anchor":141,"reason":142},"\u002Fen\u002Fpacks\u002Fcontent-creator-ai-studio","Content Creator AI Studio","TTS picks here power the voiceover pipelines in that pack",[144,148,152],{"claim":145,"source_name":146,"source_url":147},"Whisper is OpenAI's open-source speech recognition model","openai\u002Fwhisper","https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper",{"claim":149,"source_name":150,"source_url":151},"whisper.cpp ports Whisper inference to pure C\u002FC++ with CPU and Apple Silicon support","ggerganov\u002Fwhisper.cpp","https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fwhisper.cpp",{"claim":153,"source_name":154,"source_url":155},"Bark is Suno's transformer text-to-audio model supporting speech, music, and sound effects","suno-ai\u002Fbark","https:\u002F\u002Fgithub.com\u002Fsuno-ai\u002Fbark",920,"2026-05-22T10:00:00Z"]