Quick Use
ws = await cartesia.tts.websocket()to open connection early- As LLM streams tokens,
await ws.send_text(token)to feed Cartesia - Concurrently
async for chunk in ws.receive()to play audio chunks
Intro
Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes.
Pipeline LLM streaming → Cartesia streaming → speakers
import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np
oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])
async def voice_response(user_text: str):
# Open Cartesia WebSocket first so it's ready when first LLM chunk arrives
ws = await cartesia.tts.websocket()
async def feed_llm_to_tts():
stream = await oai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_text}],
stream=True,
)
async for chunk in stream:
text = chunk.choices[0].delta.content
if text:
await ws.send_text(text)
await ws.flush() # tell Cartesia: no more text
async def play_audio():
async for chunk in ws.receive():
audio = np.frombuffer(chunk.audio, dtype=np.int16)
sd.play(audio, 22_050, blocking=False)
await asyncio.gather(feed_llm_to_tts(), play_audio())
await ws.close()
asyncio.run(voice_response("Tell me about state space models in one sentence."))Why pipelining matters
| Stage | Sequential | Pipelined |
|---|---|---|
| LLM first token | 300ms | 300ms |
| LLM finish (50 tokens) | 800ms | (overlapped) |
| Cartesia first audio | 75ms after final text | 75ms after first text |
| Total time-to-first-audio | 1,175ms | ~375ms |
Handle interruptions cleanly
async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
ws = await cartesia.tts.websocket()
async def stream_audio():
async for chunk in ws.receive():
if interrupt_event.is_set():
await ws.cancel() # tells Cartesia to stop generating
return
sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)
# ...feed LLM tokens to ws as before...Output format choices
output_format={
"container": "raw", # raw bytes for direct playback; mp3 for storage
"encoding": "pcm_s16le", # PCM 16-bit little-endian
"sample_rate": 22_050, # 16k for phone audio, 22k for web, 44k for HQ
}FAQ
Q: Why not just chunk the LLM output and call /tts.bytes per chunk? A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead.
Q: What about word/sentence boundaries?
A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with flush(continue=True) for explicit boundaries.
Q: How do I detect end-of-speech?
A: Cartesia sends a final WebSocket message with is_final=True after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking.
Source & Thanks
Built by Cartesia. Streaming docs at docs.cartesia.ai/tts/realtime.