TOKREPO · ARSENAL
Stable

Voice AI Stack

Zonos, Moshi, OpenAI Realtime, LiveKit Agents — real-time voice agents and TTS that ship to production.

6 assets

What's in this pack

Voice AI is the area where the gap between "demo on a laptop" and "shipping to users" is widest. Latency, turn-taking, interruptions, and barge-in have to all work — and they don't, by default. This pack collects the six assets that the teams actually shipping voice products in 2026 are running.

# Asset Layer Why it's here
1 OpenAI Realtime API speech-to-speech hosted, sub-300ms turn latency, no STT/TTS plumbing
2 Moshi speech-to-speech open-source full-duplex; data stays local
3 Zonos TTS high-quality open-source TTS with voice cloning
4 LiveKit Agents infra WebRTC + agent orchestration, the production substrate
5 Voice agent patterns design turn-taking, barge-in, end-of-utterance detection
6 Latency budget worksheet ops end-to-end <800ms checklist by component

Why this matters

A 1.5-second response feels broken in voice. A 600ms response feels human. The difference is architectural, not just compute — it comes from how you compose STT, the LLM, TTS, and the network layer.

Three architectural choices decide whether your voice agent feels alive:

  1. Speech-to-speech vs cascade. A traditional cascade (audio → STT → LLM → TTS → audio) has 4 sequential bottlenecks and typically lands at 1.2-2.0s per turn. Speech-to-speech models (OpenAI Realtime, Moshi) cut that to 200-400ms by skipping the text intermediate. Pick speech-to-speech for conversational use cases; pick cascade only when you need precise control over the LLM step (e.g. tool calling that audio models still struggle with).
  2. Streaming vs non-streaming TTS. Non-streaming TTS waits for the full text before generating audio. Streaming starts emitting audio after the first ~100ms of text. For a 5-second response, that's 4-5 seconds of perceived-latency difference. Zonos and most production TTS support streaming; use it.
  3. WebRTC vs WebSocket. WebRTC handles packet loss, jitter, and adaptive bitrate. WebSocket doesn't. On real cellular networks the difference between a working call and a stuttering call is which transport you picked. LiveKit Agents wraps the agent loop in proper WebRTC; this is non-negotiable for mobile.

Install in one command

# Install the entire pack
tokrepo install pack/voice-ai-stack

# Or pick the layer you need first
tokrepo install livekit-agents
tokrepo install moshi
tokrepo install zonos

The TokRepo CLI drops the agent scaffolding, room config, and SDK init code into your project. A LiveKit room hooked to OpenAI Realtime can be running locally in under 10 minutes from a clean checkout.

Common pitfalls

  • Building a cascade for a conversational use case. If users are chatting (not dictating commands), use speech-to-speech. The cascade architecture made sense in 2023; in 2026 it's a latency penalty without compensating benefits for chat.
  • Skipping voice activity detection (VAD). Without VAD the agent either talks over the user (no end-of-utterance detection) or sits silent waiting for fixed timeouts. LiveKit Agents ships VAD wired in; use it.
  • No barge-in handling. When the user starts speaking while the agent is talking, the agent must detect this within ~150ms and stop. Hard-coding "wait for finish" feels robotic. All four engines support proper barge-in but it's off by default in some configs.
  • TTS prompts that don't match speech. "$1,234.56" reads aloud terribly. Pre-process numbers, dates, and abbreviations before sending to TTS. The TokRepo voice-agent-patterns asset ships a normalizer.
  • Forgetting to budget for first-turn latency. The first response in a session is always 200-400ms slower than steady state because models are loading caches. Hide the gap with a "ready" sound or a connection animation.

Common misconceptions

"Speech-to-speech can't do tool calls." Out of date — OpenAI Realtime supports function calling natively, and Moshi can be wrapped in a tool-routing agent. The 2024 limitation no longer holds.

"You need a GPU per concurrent call." For TTS-only at moderate quality, modern open-source TTS hits real-time on CPU. For speech-to-speech you need GPU for self-hosted Moshi, or you offload latency to OpenAI Realtime. LiveKit Agents handles connection multiplexing so one machine can broker many concurrent sessions even if the model lives elsewhere.

"Voice cloning is too risky to ship." Zonos and similar engines ship with consent-required watermarking flags. Used responsibly with explicit user consent (the user clones their own voice for accessibility, e.g.), it's a safe and high-value feature. The risk is unconsented cloning, which the engines themselves discourage.

INSTALL · ONE COMMAND
$ tokrepo install pack/voice-ai-stack
hand it to your agent — or paste it in your terminal
What's inside

6 assets in this pack

Script#01
Zonos — Multilingual TTS with Voice Cloning

Zonos is an open-weight TTS model trained on 200K+ hours of speech. 7.2K+ stars. Voice cloning, 5 languages, emotion control. Apache 2.0.

by Script Depot·144 views
$ tokrepo install zonos-multilingual-tts-voice-cloning-9b6992d2
Config#02
Moshi — Real-Time AI Voice Conversation Engine

Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

by AI Open Source·137 views
$ tokrepo install moshi-real-time-ai-voice-conversation-engine-6172db11
Script#03
OpenAI Realtime Agents — Voice AI Agent Patterns

Advanced agentic patterns for voice AI built on OpenAI Realtime API. Chat-supervisor and sequential handoff patterns with WebRTC streaming. MIT, 6,800+ stars.

by OpenAI·139 views
$ tokrepo install openai-realtime-agents-voice-ai-agent-patterns-0d228731
Script#04
LiveKit Agents — Build Real-Time Voice AI Agents

Framework for building real-time voice AI agents. STT, LLM, TTS pipeline with sub-second latency. Supports OpenAI, Anthropic, Deepgram, ElevenLabs. 9.9K+ stars.

by Script Depot·98 views
$ tokrepo install livekit-agents-build-real-time-voice-ai-agents-804ee888
Skill#05
Remotion AI Voiceover Skill — ElevenLabs TTS

AI skill for adding ElevenLabs text-to-speech voiceover to Remotion videos. Auto-sizes composition duration to match generated audio.

by Skill Factory·110 views
$ tokrepo install remotion-ai-voiceover-skill-elevenlabs-tts-ff8cbccc
Script#06
ElevenLabs Python SDK — AI Text-to-Speech

Official ElevenLabs Python SDK for AI voice generation. Create realistic voiceovers with 30+ languages, voice cloning, and streaming support.

by Script Depot·106 views
$ tokrepo install elevenlabs-python-sdk-ai-text-speech-16d32da9
FAQ

Frequently asked questions

Is OpenAI Realtime free?

No — it's metered per audio minute (input and output), and pricing is several times higher than text-only API calls because audio tokens are denser. For prototyping the cost is negligible; for a deployed product handling thousands of minutes/day, do the math up front. Self-hosted Moshi has zero per-minute cost but requires GPU. Most teams run Realtime in production until volume justifies the GPU bill, then migrate to Moshi.

How does Moshi compare to OpenAI Realtime?

Moshi is open-source, self-hostable, full-duplex speech-to-speech from Kyutai. OpenAI Realtime is hosted, closed-source, and somewhat higher quality on English. The decision tree: data sovereignty or zero per-minute cost → Moshi; lowest-latency hosted with no infra → OpenAI Realtime. They share the same architectural pattern, so a wrapper around either looks similar in your code.

Will this work with Cursor or Codex CLI?

Voice agents are server-side services, not editor extensions. You build them as standalone applications using LiveKit Agents and Realtime/Moshi. Cursor or Codex CLI are useful for writing the code of these agents (the TokRepo install drops working scaffolds), but the runtime is its own service. The Codex CLI tool entry has agent-building examples that target Realtime API.

What's the difference vs the LLM Observability pack?

Observability gives you traces of what happened — latency per turn, model errors, token cost. The Voice AI Stack pack is about building the runtime. You want both: install the voice stack to ship a voice agent, install observability to debug why turn 47 had a 2-second delay. LiveKit Agents emits standard OpenTelemetry traces that Langfuse and Phoenix can ingest directly.

Can I use my existing TTS with this?

Yes. The pack documents the contract LiveKit Agents expects (audio frames, end-of-utterance signals, barge-in events) and you can plug in ElevenLabs, Cartesia, Azure TTS, or any streaming-capable engine. Zonos is included as a strong open-source default. The voice-agent-patterns asset has a guide for swapping TTS engines without rewriting the agent loop.

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs