Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 11, 2026·4 min de lectura

Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS

Cartesia's streaming WebSocket pipelines LLM text chunks in and audio out simultaneously. Required for sub-second voice agent round-trips.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 17/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Stage only
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install 70fc9dc0-7d62-45b2-b645-183d91cca020
Introducción

Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes.


Pipeline LLM streaming → Cartesia streaming → speakers

import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np

oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])

async def voice_response(user_text: str):
    # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives
    ws = await cartesia.tts.websocket()

    async def feed_llm_to_tts():
        stream = await oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_text}],
            stream=True,
        )
        async for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                await ws.send_text(text)
        await ws.flush()   # tell Cartesia: no more text

    async def play_audio():
        async for chunk in ws.receive():
            audio = np.frombuffer(chunk.audio, dtype=np.int16)
            sd.play(audio, 22_050, blocking=False)

    await asyncio.gather(feed_llm_to_tts(), play_audio())
    await ws.close()

asyncio.run(voice_response("Tell me about state space models in one sentence."))

Why pipelining matters

Stage Sequential Pipelined
LLM first token 300ms 300ms
LLM finish (50 tokens) 800ms (overlapped)
Cartesia first audio 75ms after final text 75ms after first text
Total time-to-first-audio 1,175ms ~375ms

Handle interruptions cleanly

async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
    ws = await cartesia.tts.websocket()

    async def stream_audio():
        async for chunk in ws.receive():
            if interrupt_event.is_set():
                await ws.cancel()    # tells Cartesia to stop generating
                return
            sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)

    # ...feed LLM tokens to ws as before...

Output format choices

output_format={
    "container": "raw",            # raw bytes for direct playback; mp3 for storage
    "encoding": "pcm_s16le",       # PCM 16-bit little-endian
    "sample_rate": 22_050,         # 16k for phone audio, 22k for web, 44k for HQ
}

FAQ

Q: Why not just chunk the LLM output and call /tts.bytes per chunk? A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead.

Q: What about word/sentence boundaries? A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with flush(continue=True) for explicit boundaries.

Q: How do I detect end-of-speech? A: Cartesia sends a final WebSocket message with is_final=True after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking.


Quick Use

  1. ws = await cartesia.tts.websocket() to open connection early
  2. As LLM streams tokens, await ws.send_text(token) to feed Cartesia
  3. Concurrently async for chunk in ws.receive() to play audio chunks

Intro

Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes.


Pipeline LLM streaming → Cartesia streaming → speakers

import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np

oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])

async def voice_response(user_text: str):
    # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives
    ws = await cartesia.tts.websocket()

    async def feed_llm_to_tts():
        stream = await oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_text}],
            stream=True,
        )
        async for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                await ws.send_text(text)
        await ws.flush()   # tell Cartesia: no more text

    async def play_audio():
        async for chunk in ws.receive():
            audio = np.frombuffer(chunk.audio, dtype=np.int16)
            sd.play(audio, 22_050, blocking=False)

    await asyncio.gather(feed_llm_to_tts(), play_audio())
    await ws.close()

asyncio.run(voice_response("Tell me about state space models in one sentence."))

Why pipelining matters

Stage Sequential Pipelined
LLM first token 300ms 300ms
LLM finish (50 tokens) 800ms (overlapped)
Cartesia first audio 75ms after final text 75ms after first text
Total time-to-first-audio 1,175ms ~375ms

Handle interruptions cleanly

async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
    ws = await cartesia.tts.websocket()

    async def stream_audio():
        async for chunk in ws.receive():
            if interrupt_event.is_set():
                await ws.cancel()    # tells Cartesia to stop generating
                return
            sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)

    # ...feed LLM tokens to ws as before...

Output format choices

output_format={
    "container": "raw",            # raw bytes for direct playback; mp3 for storage
    "encoding": "pcm_s16le",       # PCM 16-bit little-endian
    "sample_rate": 22_050,         # 16k for phone audio, 22k for web, 44k for HQ
}

FAQ

Q: Why not just chunk the LLM output and call /tts.bytes per chunk? A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead.

Q: What about word/sentence boundaries? A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with flush(continue=True) for explicit boundaries.

Q: How do I detect end-of-speech? A: Cartesia sends a final WebSocket message with is_final=True after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking.


Source & Thanks

Built by Cartesia. Streaming docs at docs.cartesia.ai/tts/realtime.

cartesia-ai/cartesia-python

🙏

Fuente y agradecimientos

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados