Voice AI Stack
Zonos, Moshi, OpenAI Realtime, LiveKit Agents — real-time voice agents and TTS that ship to production.
What's in this pack
Voice AI is the area where the gap between "demo on a laptop" and "shipping to users" is widest. Latency, turn-taking, interruptions, and barge-in have to all work — and they don't, by default. This pack collects the six assets that the teams actually shipping voice products in 2026 are running.
| # | Asset | Layer | Why it's here |
|---|---|---|---|
| 1 | OpenAI Realtime API | speech-to-speech | hosted, sub-300ms turn latency, no STT/TTS plumbing |
| 2 | Moshi | speech-to-speech | open-source full-duplex; data stays local |
| 3 | Zonos | TTS | high-quality open-source TTS with voice cloning |
| 4 | LiveKit Agents | infra | WebRTC + agent orchestration, the production substrate |
| 5 | Voice agent patterns | design | turn-taking, barge-in, end-of-utterance detection |
| 6 | Latency budget worksheet | ops | end-to-end <800ms checklist by component |
Why this matters
A 1.5-second response feels broken in voice. A 600ms response feels human. The difference is architectural, not just compute — it comes from how you compose STT, the LLM, TTS, and the network layer.
Three architectural choices decide whether your voice agent feels alive:
- Speech-to-speech vs cascade. A traditional cascade (audio → STT → LLM → TTS → audio) has 4 sequential bottlenecks and typically lands at 1.2-2.0s per turn. Speech-to-speech models (OpenAI Realtime, Moshi) cut that to 200-400ms by skipping the text intermediate. Pick speech-to-speech for conversational use cases; pick cascade only when you need precise control over the LLM step (e.g. tool calling that audio models still struggle with).
- Streaming vs non-streaming TTS. Non-streaming TTS waits for the full text before generating audio. Streaming starts emitting audio after the first ~100ms of text. For a 5-second response, that's 4-5 seconds of perceived-latency difference. Zonos and most production TTS support streaming; use it.
- WebRTC vs WebSocket. WebRTC handles packet loss, jitter, and adaptive bitrate. WebSocket doesn't. On real cellular networks the difference between a working call and a stuttering call is which transport you picked. LiveKit Agents wraps the agent loop in proper WebRTC; this is non-negotiable for mobile.
Install in one command
# Install the entire pack
tokrepo install pack/voice-ai-stack
# Or pick the layer you need first
tokrepo install livekit-agents
tokrepo install moshi
tokrepo install zonos
The TokRepo CLI drops the agent scaffolding, room config, and SDK init code into your project. A LiveKit room hooked to OpenAI Realtime can be running locally in under 10 minutes from a clean checkout.
Common pitfalls
- Building a cascade for a conversational use case. If users are chatting (not dictating commands), use speech-to-speech. The cascade architecture made sense in 2023; in 2026 it's a latency penalty without compensating benefits for chat.
- Skipping voice activity detection (VAD). Without VAD the agent either talks over the user (no end-of-utterance detection) or sits silent waiting for fixed timeouts. LiveKit Agents ships VAD wired in; use it.
- No barge-in handling. When the user starts speaking while the agent is talking, the agent must detect this within ~150ms and stop. Hard-coding "wait for finish" feels robotic. All four engines support proper barge-in but it's off by default in some configs.
- TTS prompts that don't match speech. "$1,234.56" reads aloud terribly. Pre-process numbers, dates, and abbreviations before sending to TTS. The TokRepo voice-agent-patterns asset ships a normalizer.
- Forgetting to budget for first-turn latency. The first response in a session is always 200-400ms slower than steady state because models are loading caches. Hide the gap with a "ready" sound or a connection animation.
Common misconceptions
"Speech-to-speech can't do tool calls." Out of date — OpenAI Realtime supports function calling natively, and Moshi can be wrapped in a tool-routing agent. The 2024 limitation no longer holds.
"You need a GPU per concurrent call." For TTS-only at moderate quality, modern open-source TTS hits real-time on CPU. For speech-to-speech you need GPU for self-hosted Moshi, or you offload latency to OpenAI Realtime. LiveKit Agents handles connection multiplexing so one machine can broker many concurrent sessions even if the model lives elsewhere.
"Voice cloning is too risky to ship." Zonos and similar engines ship with consent-required watermarking flags. Used responsibly with explicit user consent (the user clones their own voice for accessibility, e.g.), it's a safe and high-value feature. The risk is unconsented cloning, which the engines themselves discourage.
6 assets in this pack
Frequently asked questions
Is OpenAI Realtime free?
No — it's metered per audio minute (input and output), and pricing is several times higher than text-only API calls because audio tokens are denser. For prototyping the cost is negligible; for a deployed product handling thousands of minutes/day, do the math up front. Self-hosted Moshi has zero per-minute cost but requires GPU. Most teams run Realtime in production until volume justifies the GPU bill, then migrate to Moshi.
How does Moshi compare to OpenAI Realtime?
Moshi is open-source, self-hostable, full-duplex speech-to-speech from Kyutai. OpenAI Realtime is hosted, closed-source, and somewhat higher quality on English. The decision tree: data sovereignty or zero per-minute cost → Moshi; lowest-latency hosted with no infra → OpenAI Realtime. They share the same architectural pattern, so a wrapper around either looks similar in your code.
Will this work with Cursor or Codex CLI?
Voice agents are server-side services, not editor extensions. You build them as standalone applications using LiveKit Agents and Realtime/Moshi. Cursor or Codex CLI are useful for writing the code of these agents (the TokRepo install drops working scaffolds), but the runtime is its own service. The Codex CLI tool entry has agent-building examples that target Realtime API.
What's the difference vs the LLM Observability pack?
Observability gives you traces of what happened — latency per turn, model errors, token cost. The Voice AI Stack pack is about building the runtime. You want both: install the voice stack to ship a voice agent, install observability to debug why turn 47 had a 2-second delay. LiveKit Agents emits standard OpenTelemetry traces that Langfuse and Phoenix can ingest directly.
Can I use my existing TTS with this?
Yes. The pack documents the contract LiveKit Agents expects (audio frames, end-of-utterance signals, barge-in events) and you can plug in ElevenLabs, Cartesia, Azure TTS, or any streaming-capable engine. Zonos is included as a strong open-source default. The voice-agent-patterns asset has a guide for swapping TTS engines without rewriting the agent loop.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs