Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 11, 2026·4 min de lecture

Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS

Cartesia's streaming WebSocket pipelines LLM text chunks in and audio out simultaneously. Required for sub-second voice agent round-trips.

Cartesia
Cartesia · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 17/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 70fc9dc0-7d62-45b2-b645-183d91cca020
Introduction

Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes.


Pipeline LLM streaming → Cartesia streaming → speakers

import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np

oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])

async def voice_response(user_text: str):
    # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives
    ws = await cartesia.tts.websocket()

    async def feed_llm_to_tts():
        stream = await oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_text}],
            stream=True,
        )
        async for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                await ws.send_text(text)
        await ws.flush()   # tell Cartesia: no more text

    async def play_audio():
        async for chunk in ws.receive():
            audio = np.frombuffer(chunk.audio, dtype=np.int16)
            sd.play(audio, 22_050, blocking=False)

    await asyncio.gather(feed_llm_to_tts(), play_audio())
    await ws.close()

asyncio.run(voice_response("Tell me about state space models in one sentence."))

Why pipelining matters

Stage Sequential Pipelined
LLM first token 300ms 300ms
LLM finish (50 tokens) 800ms (overlapped)
Cartesia first audio 75ms after final text 75ms after first text
Total time-to-first-audio 1,175ms ~375ms

Handle interruptions cleanly

async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
    ws = await cartesia.tts.websocket()

    async def stream_audio():
        async for chunk in ws.receive():
            if interrupt_event.is_set():
                await ws.cancel()    # tells Cartesia to stop generating
                return
            sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)

    # ...feed LLM tokens to ws as before...

Output format choices

output_format={
    "container": "raw",            # raw bytes for direct playback; mp3 for storage
    "encoding": "pcm_s16le",       # PCM 16-bit little-endian
    "sample_rate": 22_050,         # 16k for phone audio, 22k for web, 44k for HQ
}

FAQ

Q: Why not just chunk the LLM output and call /tts.bytes per chunk? A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead.

Q: What about word/sentence boundaries? A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with flush(continue=True) for explicit boundaries.

Q: How do I detect end-of-speech? A: Cartesia sends a final WebSocket message with is_final=True after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking.


Quick Use

  1. ws = await cartesia.tts.websocket() to open connection early
  2. As LLM streams tokens, await ws.send_text(token) to feed Cartesia
  3. Concurrently async for chunk in ws.receive() to play audio chunks

Intro

Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes.


Pipeline LLM streaming → Cartesia streaming → speakers

import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np

oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])

async def voice_response(user_text: str):
    # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives
    ws = await cartesia.tts.websocket()

    async def feed_llm_to_tts():
        stream = await oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_text}],
            stream=True,
        )
        async for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                await ws.send_text(text)
        await ws.flush()   # tell Cartesia: no more text

    async def play_audio():
        async for chunk in ws.receive():
            audio = np.frombuffer(chunk.audio, dtype=np.int16)
            sd.play(audio, 22_050, blocking=False)

    await asyncio.gather(feed_llm_to_tts(), play_audio())
    await ws.close()

asyncio.run(voice_response("Tell me about state space models in one sentence."))

Why pipelining matters

Stage Sequential Pipelined
LLM first token 300ms 300ms
LLM finish (50 tokens) 800ms (overlapped)
Cartesia first audio 75ms after final text 75ms after first text
Total time-to-first-audio 1,175ms ~375ms

Handle interruptions cleanly

async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
    ws = await cartesia.tts.websocket()

    async def stream_audio():
        async for chunk in ws.receive():
            if interrupt_event.is_set():
                await ws.cancel()    # tells Cartesia to stop generating
                return
            sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)

    # ...feed LLM tokens to ws as before...

Output format choices

output_format={
    "container": "raw",            # raw bytes for direct playback; mp3 for storage
    "encoding": "pcm_s16le",       # PCM 16-bit little-endian
    "sample_rate": 22_050,         # 16k for phone audio, 22k for web, 44k for HQ
}

FAQ

Q: Why not just chunk the LLM output and call /tts.bytes per chunk? A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead.

Q: What about word/sentence boundaries? A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with flush(continue=True) for explicit boundaries.

Q: How do I detect end-of-speech? A: Cartesia sends a final WebSocket message with is_final=True after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking.


Source & Thanks

Built by Cartesia. Streaming docs at docs.cartesia.ai/tts/realtime.

cartesia-ai/cartesia-python

🙏

Source et remerciements

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires