# Cartesia Streaming WebSocket — Full-Duplex Voice Agent TTS

> Cartesia's streaming WebSocket pipelines LLM text chunks in and audio out simultaneously. Required for sub-second voice agent round-trips.

## Install

Save as a script file and run:

## Quick Use

1. `ws = await cartesia.tts.websocket()` to open connection early
2. As LLM streams tokens, `await ws.send_text(token)` to feed Cartesia
3. Concurrently `async for chunk in ws.receive()` to play audio chunks

---

## Intro

Cartesia's streaming WebSocket TTS lets you pipeline text-from-LLM-streaming directly into Cartesia and play audio as it arrives. Don't wait for the LLM to finish; don't wait for Cartesia to finish — overlap both. This is how production voice agents hit sub-1.5s round-trips. Best for: LiveKit Agents, Vapi, custom voice agent pipelines, anywhere TTFA + LLM TTFB stack on top of each other. Works with: cartesia Python/JS SDK + any async LLM streaming source. Setup time: 15 minutes.

---

### Pipeline LLM streaming → Cartesia streaming → speakers

```python
import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np

oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])

async def voice_response(user_text: str):
    # Open Cartesia WebSocket first so it's ready when first LLM chunk arrives
    ws = await cartesia.tts.websocket()

    async def feed_llm_to_tts():
        stream = await oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_text}],
            stream=True,
        )
        async for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                await ws.send_text(text)
        await ws.flush()   # tell Cartesia: no more text

    async def play_audio():
        async for chunk in ws.receive():
            audio = np.frombuffer(chunk.audio, dtype=np.int16)
            sd.play(audio, 22_050, blocking=False)

    await asyncio.gather(feed_llm_to_tts(), play_audio())
    await ws.close()

asyncio.run(voice_response("Tell me about state space models in one sentence."))
```

### Why pipelining matters

| Stage | Sequential | Pipelined |
|---|---|---|
| LLM first token | 300ms | 300ms |
| LLM finish (50 tokens) | 800ms | (overlapped) |
| Cartesia first audio | 75ms after final text | 75ms after first text |
| **Total time-to-first-audio** | **1,175ms** | **~375ms** |

### Handle interruptions cleanly

```python
async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
    ws = await cartesia.tts.websocket()

    async def stream_audio():
        async for chunk in ws.receive():
            if interrupt_event.is_set():
                await ws.cancel()    # tells Cartesia to stop generating
                return
            sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)

    # ...feed LLM tokens to ws as before...
```

### Output format choices

```python
output_format={
    "container": "raw",            # raw bytes for direct playback; mp3 for storage
    "encoding": "pcm_s16le",       # PCM 16-bit little-endian
    "sample_rate": 22_050,         # 16k for phone audio, 22k for web, 44k for HQ
}
```

---

### FAQ

**Q: Why not just chunk the LLM output and call /tts.bytes per chunk?**
A: Connection overhead per HTTP call dominates — each chunk costs ~50ms of TCP/TLS handshake even if cached. WebSocket lets you keep one connection open and stream natural sub-second chunks without overhead.

**Q: What about word/sentence boundaries?**
A: Cartesia handles partial input gracefully — it waits internally for safe boundary points (mid-word vs end-of-sentence). You can also force segmentation with `flush(continue=True)` for explicit boundaries.

**Q: How do I detect end-of-speech?**
A: Cartesia sends a final WebSocket message with `is_final=True` after your flush(). Use this to clean up audio queue and signal your VAD that the agent finished speaking.

---

## Source & Thanks

> Built by [Cartesia](https://github.com/cartesia-ai). Streaming docs at [docs.cartesia.ai/tts/realtime](https://docs.cartesia.ai).
>
> [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python)

---

<!-- ZH -->

## 快速使用

1. 提前 `ws = await cartesia.tts.websocket()` 开连接
2. LLM 流式 token 出来时 `await ws.send_text(token)` 喂 Cartesia
3. 同时 `async for chunk in ws.receive()` 播音频块

---

## 简介

Cartesia 流式 WebSocket TTS 让你把 LLM 流式输出的文本直接管道给 Cartesia，音频边到边播。不等 LLM 结束、不等 Cartesia 结束 —— 两者重叠。这就是生产语音 agent 打到 <1.5 秒往返的方式。适合 LiveKit Agents、Vapi、自定义语音 agent 流水线、任何 TTFA + LLM TTFB 叠加的场景。兼容 cartesia Python/JS SDK + 任何异步 LLM 流源。装机时间 15 分钟。

---

### 流水线 LLM 流式 → Cartesia 流式 → 扬声器

```python
import asyncio
from openai import AsyncOpenAI
from cartesia import AsyncCartesia
import sounddevice as sd
import numpy as np

oai = AsyncOpenAI()
cartesia = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])

async def voice_response(user_text: str):
    # 先开 Cartesia WebSocket，LLM 第一块到时它已准备好
    ws = await cartesia.tts.websocket()

    async def feed_llm_to_tts():
        stream = await oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": user_text}],
            stream=True,
        )
        async for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                await ws.send_text(text)
        await ws.flush()   # 告诉 Cartesia：没文本了

    async def play_audio():
        async for chunk in ws.receive():
            audio = np.frombuffer(chunk.audio, dtype=np.int16)
            sd.play(audio, 22_050, blocking=False)

    await asyncio.gather(feed_llm_to_tts(), play_audio())
    await ws.close()

asyncio.run(voice_response("Tell me about state space models in one sentence."))
```

### 为啥流水线重要

| 阶段 | 顺序 | 流水线 |
|---|---|---|
| LLM 首 token | 300ms | 300ms |
| LLM 结束（50 token）| 800ms | （重叠）|
| Cartesia 首音频 | 文本最终后 75ms | 首文本后 75ms |
| **总首音频时间** | **1,175ms** | **~375ms** |

### 干净处理打断

```python
async def voice_response_with_barge_in(user_text: str, interrupt_event: asyncio.Event):
    ws = await cartesia.tts.websocket()

    async def stream_audio():
        async for chunk in ws.receive():
            if interrupt_event.is_set():
                await ws.cancel()    # 让 Cartesia 停止生成
                return
            sd.play(np.frombuffer(chunk.audio, dtype=np.int16), 22_050)

    # ...同上把 LLM token 喂给 ws...
```

### 输出格式选择

```python
output_format={
    "container": "raw",            # 直接播用 raw 字节；存档用 mp3
    "encoding": "pcm_s16le",       # PCM 16 位小端
    "sample_rate": 22_050,         # 电话 16k、web 22k、高保真 44k
}
```

---

### FAQ

**Q: 为啥不分块 LLM 输出每块调一次 /tts.bytes？**
A: 每次 HTTP 调用的连接开销主导 —— 每块约 50ms TCP/TLS 握手即使缓存。WebSocket 让一条连接保活，无开销流式发自然 <1 秒块。

**Q: 词/句边界怎么办？**
A: Cartesia 优雅处理部分输入 —— 内部等安全边界点（词中 vs 句末）。也可以 `flush(continue=True)` 强制分段做显式边界。

**Q: 怎么检测发言结束？**
A: 你 flush() 后 Cartesia 发一个 `is_final=True` 的最终 WebSocket 消息。用它清音频队列，向 VAD 信号 agent 说完了。

---

## 来源与感谢

> Built by [Cartesia](https://github.com/cartesia-ai). Streaming docs at [docs.cartesia.ai/tts/realtime](https://docs.cartesia.ai).
>
> [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python)


---
Source: https://tokrepo.com/en/workflows/cartesia-streaming-websocket-full-duplex-voice-agent-tts
Author: Cartesia