How do I install Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS

Name: Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS
Author: Cartesia

import asyncio from openai import AsyncOpenAI from cartesia import AsyncCartesia import sounddevice as sd import numpy as np oai = AsyncOpenAI() cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"]) async def voice_response(user_text: str): # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives ws = await cartesia.tts.websocket() async def feed_llm_to_tts(): stream = await oai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_text}], stream=True, ) async for chunk in stream: text = chunk.choices[0].delta.content if text: await ws.send_text(text) await ws.flush() # tell Cartesia: no more text async def play_audio(): async for chunk in ws.receive(): audio = np.frombuffer(chunk.audio, dtype=np.int16) sd.play(audio, 22_050, blocking=False) await asyncio.gather(feed_llm_to_tts(), play_audio()) await ws.close() asyncio.run(voice_response("Tell me about state space models in one sentence."))

Stage

Sequential

Pipelined

LLM first token

300ms

LLM finish (50 tokens)

800ms

(overlapped)

Cartesia first audio

75ms after final text

75ms after first text

Total time-to-first-audio

1,175ms

~375ms

async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event): ws = await cartesia.tts.websocket() async def stream_audio(): async for chunk in ws.receive(): if interrupt_event.is_set(): await ws.cancel() # tells Cartesia to stop generating return sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050) # ...feed LLM tokens to ws as before...

output_format={ "container": "raw", # raw bytes for direct playback; mp3 for storage "encoding": "pcm_s16le", # PCM 16-bit little-endian "sample_rate": 22_050, # 16k for phone audio, 22k for web, 44k for HQ }

Quick Use

ws = await cartesia.tts.websocket() to open connection early
As LLM streams tokens, await ws.send_text(token) to feed Cartesia
Concurrently async for chunk in ws.receive() to play audio chunks

Intro

Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes.

Pipeline LLM streaming → Cartesia streaming → speakers

import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np

oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])

async def voice_response(user_text: str):
    # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives
    ws = await cartesia.tts.websocket()

    async def feed_llm_to_tts():
        stream = await oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_text}],
            stream=True,
        )
        async for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                await ws.send_text(text)
        await ws.flush()   # tell Cartesia: no more text

    async def play_audio():
        async for chunk in ws.receive():
            audio = np.frombuffer(chunk.audio, dtype=np.int16)
            sd.play(audio, 22_050, blocking=False)

    await asyncio.gather(feed_llm_to_tts(), play_audio())
    await ws.close()

asyncio.run(voice_response("Tell me about state space models in one sentence."))

Why pipelining matters

Stage	Sequential	Pipelined
LLM first token	300ms	300ms
LLM finish (50 tokens)	800ms	(overlapped)
Cartesia first audio	75ms after final text	75ms after first text
Total time-to-first-audio	1,175ms	~375ms

Handle interruptions cleanly

async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
    ws = await cartesia.tts.websocket()

    async def stream_audio():
        async for chunk in ws.receive():
            if interrupt_event.is_set():
                await ws.cancel()    # tells Cartesia to stop generating
                return
            sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)

    # ...feed LLM tokens to ws as before...

Output format choices

output_format={
    "container": "raw",            # raw bytes for direct playback; mp3 for storage
    "encoding": "pcm_s16le",       # PCM 16-bit little-endian
    "sample_rate": 22_050,         # 16k for phone audio, 22k for web, 44k for HQ
}

FAQ

Q: Why not just chunk the LLM output and call /tts.bytes per chunk? A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead.

Q: What about word/sentence boundaries? A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with flush(continue=True) for explicit boundaries.

Q: How do I detect end-of-speech? A: Cartesia sends a final WebSocket message with is_final=True after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking.

Source & Thanks

Built by Cartesia. Streaming docs at docs.cartesia.ai/tts/realtime.

cartesia-ai/cartesia-python

Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS

This asset can be read and installed directly by agents

Pipeline LLM streaming → Cartesia streaming → speakers

Why pipelining matters

Handle interruptions cleanly

Output format choices

FAQ

Quick Use

Intro

Pipeline LLM streaming → Cartesia streaming → speakers

Why pipelining matters

Handle interruptions cleanly

Output format choices

FAQ

Source & Thanks

Source & Thanks

Discussion

Related Assets

Deepgram Aura TTS — Text-to-Speech for Voice Agents

Deepgram Voice Agent API — Unified STT+LLM+TTS

Vapi — Voice AI Agent Platform with STT, LLM & TTS

Cartesia Sonic TTS — 75ms Time-to-First-Audio