# Cartesia Sonic TTS — 75ms Time-to-First-Audio

> Cartesia Sonic is a state-space-model TTS with 75ms time-to-first-audio. 100+ voices, 5s cloning, streaming WebSocket. Lowest-latency TTS.

## Install

Copy the content below into your project:

## Quick Use

1. `pip install cartesia` and get CARTESIA_API_KEY at play.cartesia.ai
2. `client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT)` for batch
3. `client.tts.websocket()` for sub-75ms streaming voice agent latency

---

## Intro

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.

---

### Basic synthesis (single audio buffer)

```python
from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)
```

### Streaming WebSocket (lowest latency)

```python
import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))
```

### Voice control (speed + emotion)

```python
audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)
```

### Latency vs others (May 2026, p50)

| Provider | Time to first audio |
|---|---|
| **Cartesia Sonic** | **75ms** |
| Deepgram Aura | ~250ms |
| ElevenLabs Turbo v2.5 | ~280ms |
| OpenAI TTS-1 | ~400ms |
| Google Cloud TTS | ~500ms |

### Cost (May 2026)

- Pay-as-you-go: $0.025 / 1,000 characters
- Free tier: 10,000 characters/month
- Pro tier: 100,000 chars/month for $5

---

### FAQ

**Q: Why is Cartesia so much faster than transformer TTS?**
A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

**Q: How good is voice cloning from 5 seconds?**
A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

**Q: Cartesia vs ElevenLabs for production?**
A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.

---

## Source & Thanks

> Built by [Cartesia](https://github.com/cartesia-ai). Docs at [docs.cartesia.ai](https://docs.cartesia.ai).
>
> [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python) — official SDK

---

<!-- ZH -->

## 快速使用

1. `pip install cartesia`，在 play.cartesia.ai 拿 CARTESIA_API_KEY
2. 批量用 `client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT)`
3. 语音 agent <75ms 流式用 `client.tts.websocket()`

---

## 简介

Cartesia Sonic 是基于状态空间模型（不是 transformer）的生产 TTS —— 首音频 75ms，商用 TTS 里最低。100+ 预置嗓音、5 秒样本即时克隆、流式 WebSocket API、15 种语言、可控语速和情感。适合 TTS 延迟主导往返预算的语音 agent、实时游戏、快速响应 IVR、多语言客服。兼容官方 Python SDK、REST、WebSocket；LiveKit / Vapi 内置插件。装机时间 5 分钟。

---

### 基础合成（单音频 buffer）

```python
from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)
```

### 流式 WebSocket（最低延迟）

```python
import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # 边来边播
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))
```

### 嗓音控制（语速 + 情感）

```python
audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)
```

### 跟同行延迟对比（2026 年 5 月，p50）

| 提供商 | 首音频时间 |
|---|---|
| **Cartesia Sonic** | **75ms** |
| Deepgram Aura | ~250ms |
| ElevenLabs Turbo v2.5 | ~280ms |
| OpenAI TTS-1 | ~400ms |
| Google Cloud TTS | ~500ms |

### 成本（2026 年 5 月）

- 按用量：$0.025 / 1,000 字符
- 免费档：10,000 字符/月
- Pro：100,000 字符/月 $5

---

### FAQ

**Q: Cartesia 为啥比 transformer TTS 快这么多？**
A: 状态空间模型推理成本对序列长度线性（transformer 是平方）。短 prompt 差距小；长生成 Cartesia 真流式生成，每帧时间恒定。75ms TTFA 就是架构红利。

**Q: 5 秒嗓音克隆效果如何？**
A: 英语意外地好 —— 音色、口音、节奏可识别。非英语源样本需要 ~10 秒达到同质量。高保真角色嗓音用 30 秒源片段走 Voice Design endpoint。

**Q: 生产环境 Cartesia vs ElevenLabs？**
A: Cartesia 延迟赢 200+ms —— 语音 agent 必选。ElevenLabs 长篇旁白更自然、语言覆盖更广（32 vs 15）。聊天式语音 agent → Cartesia。有声书 → ElevenLabs。

---

## 来源与感谢

> Built by [Cartesia](https://github.com/cartesia-ai). Docs at [docs.cartesia.ai](https://docs.cartesia.ai).
>
> [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python) — official SDK


---
Source: https://tokrepo.com/en/workflows/cartesia-sonic-tts-75ms-time-to-first-audio
Author: Cartesia