# Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS > Cartesia's streaming WebSocket pipelines LLM text chunks in and audio out simultaneously. Required for sub-second voice agent round-trips. ## Install Save as a script file and run: ## Quick Use 1. `ws = await cartesia.tts.websocket()` to open connection early 2. As LLM streams tokens, `await ws.send_text(token)` to feed Cartesia 3. Concurrently `async for chunk in ws.receive()` to play audio chunks --- ## Intro Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes. --- ### Pipeline LLM streaming → Cartesia streaming → speakers ```python import asyncio from openai import AsyncOpenAI from cartesia import AsyncCartesia import sounddevice as sd import numpy as np oai = AsyncOpenAI() cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"]) async def voice_response(user_text: str): # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives ws = await cartesia.tts.websocket() async def feed_llm_to_tts(): stream = await oai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_text}], stream=True, ) async for chunk in stream: text = chunk.choices[0].delta.content if text: await ws.send_text(text) await ws.flush() # tell Cartesia: no more text async def play_audio(): async for chunk in ws.receive(): audio = np.frombuffer(chunk.audio, dtype=np.int16) sd.play(audio, 22_050, blocking=False) await asyncio.gather(feed_llm_to_tts(), play_audio()) await ws.close() asyncio.run(voice_response("Tell me about state space models in one sentence.")) ``` ### Why pipelining matters | Stage | Sequential | Pipelined | |---|---|---| | LLM first token | 300ms | 300ms | | LLM finish (50 tokens) | 800ms | (overlapped) | | Cartesia first audio | 75ms after final text | 75ms after first text | | **Total time-to-first-audio** | **1,175ms** | **~375ms** | ### Handle interruptions cleanly ```python async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event): ws = await cartesia.tts.websocket() async def stream_audio(): async for chunk in ws.receive(): if interrupt_event.is_set(): await ws.cancel() # tells Cartesia to stop generating return sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050) # ...feed LLM tokens to ws as before... ``` ### Output format choices ```python output_format={ "container": "raw", # raw bytes for direct playback; mp3 for storage "encoding": "pcm_s16le", # PCM 16-bit little-endian "sample_rate": 22_050, # 16k for phone audio, 22k for web, 44k for HQ } ``` --- ### FAQ **Q: Why not just chunk the LLM output and call /tts.bytes per chunk?** A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead. **Q: What about word/sentence boundaries?** A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with `flush(continue=True)` for explicit boundaries. **Q: How do I detect end-of-speech?** A: Cartesia sends a final WebSocket message with `is_final=True` after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking. --- ## Source & Thanks > Built by [Cartesia](https://github.com/cartesia-ai). Streaming docs at [docs.cartesia.ai/tts/realtime](https://docs.cartesia.ai). > > [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python) --- ## 快速使用 1. 提前 `ws = await cartesia.tts.websocket()` 开连接 2. LLM 流式 token 出来时 `await ws.send_text(token)` 喂 Cartesia 3. 同时 `async for chunk in ws.receive()` 播音频块 --- ## 简介 Cartesia 流式 WebSocket TTS 让你把 LLM 流式输出的文本直接管道给 Cartesia,音频边到边播。不等 LLM 结束、不等 Cartesia 结束 —— 两者重叠。这就是生产语音 agent 打到 <1.5 秒往返的方式。适合 LiveKit Agents、Vapi、自定义语音 agent 流水线、任何 TTFA + LLM TTFB 叠加的场景。兼容 cartesia Python/JS SDK + 任何异步 LLM 流源。装机时间 15 分钟。 --- ### 流水线 LLM 流式 → Cartesia 流式 → 扬声器 ```python import asyncio from openai import AsyncOpenAI from cartesia import AsyncCartesia import sounddevice as sd import numpy as np oai = AsyncOpenAI() cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"]) async def voice_response(user_text: str): # 先开 Cartesia WebSocket,LLM 第一块到时它已准备好 ws = await cartesia.tts.websocket() async def feed_llm_to_tts(): stream = await oai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_text}], stream=True, ) async for chunk in stream: text = chunk.choices[0].delta.content if text: await ws.send_text(text) await ws.flush() # 告诉 Cartesia:没文本了 async def play_audio(): async for chunk in ws.receive(): audio = np.frombuffer(chunk.audio, dtype=np.int16) sd.play(audio, 22_050, blocking=False) await asyncio.gather(feed_llm_to_tts(), play_audio()) await ws.close() asyncio.run(voice_response("Tell me about state space models in one sentence.")) ``` ### 为啥流水线重要 | 阶段 | 顺序 | 流水线 | |---|---|---| | LLM 首 token | 300ms | 300ms | | LLM 结束(50 token)| 800ms | (重叠)| | Cartesia 首音频 | 文本最终后 75ms | 首文本后 75ms | | **总首音频时间** | **1,175ms** | **~375ms** | ### 干净处理打断 ```python async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event): ws = await cartesia.tts.websocket() async def stream_audio(): async for chunk in ws.receive(): if interrupt_event.is_set(): await ws.cancel() # 让 Cartesia 停止生成 return sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050) # ...同上把 LLM token 喂给 ws... ``` ### 输出格式选择 ```python output_format={ "container": "raw", # 直接播用 raw 字节;存档用 mp3 "encoding": "pcm_s16le", # PCM 16 位小端 "sample_rate": 22_050, # 电话 16k、web 22k、高保真 44k } ``` --- ### FAQ **Q: 为啥不分块 LLM 输出每块调一次 /tts.bytes?** A: 每次 HTTP 调用的连接开销主导 —— 每块约 50ms TCP/TLS 握手即使缓存。WebSocket 让一条连接保活,无开销流式发自然 <1 秒块。 **Q: 词/句边界怎么办?** A: Cartesia 优雅处理部分输入 —— 内部等安全边界点(词中 vs 句末)。也可以 `flush(continue=True)` 强制分段做显式边界。 **Q: 怎么检测发言结束?** A: 你 flush() 后 Cartesia 发一个 `is_final=True` 的最终 WebSocket 消息。用它清音频队列,向 VAD 信号 agent 说完了。 --- ## 来源与感谢 > Built by [Cartesia](https://github.com/cartesia-ai). Streaming docs at [docs.cartesia.ai/tts/realtime](https://docs.cartesia.ai). > > [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python) --- Source: https://tokrepo.com/en/workflows/cartesia-streaming-websocket-full-duplex-voice-agent-tts Author: Cartesia