Quick Use
pip install cartesiaand get CARTESIA_API_KEY at play.cartesia.aiclient.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT)for batchclient.tts.websocket()for sub-75ms streaming voice agent latency
Intro
Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.
Basic synthesis (single audio buffer)
from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])
audio = client.tts.bytes(
model_id="sonic-2",
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # "Helpful Woman"
transcript="Welcome back to TokRepo. You have three new asset notifications.",
output_format={"container": "mp3", "sample_rate": 44_100},
language="en",
)
with open("welcome.mp3", "wb") as f:
f.write(audio)Streaming WebSocket (lowest latency)
import asyncio
import sounddevice as sd
import numpy as np
async def stream_tts(text: str):
ws = await client.tts.websocket()
audio_chunks = []
async for chunk in ws.send(
model_id="sonic-2",
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
transcript=text,
output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
):
audio = np.frombuffer(chunk.audio, dtype=np.int16)
sd.play(audio, 22_050) # play as it arrives
await ws.close()
asyncio.run(stream_tts("Hi there! What can I help with today?"))Voice control (speed + emotion)
audio = client.tts.bytes(
model_id="sonic-2",
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
transcript="Thank you for your patience — we'll have an answer for you soon.",
voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
output_format={"container": "mp3"},
)Latency vs others (May 2026, p50)
| Provider | Time to first audio |
|---|---|
| Cartesia Sonic | 75ms |
| Deepgram Aura | ~250ms |
| ElevenLabs Turbo v2.5 | ~280ms |
| OpenAI TTS-1 | ~400ms |
| Google Cloud TTS | ~500ms |
Cost (May 2026)
- Pay-as-you-go: $0.025 / 1,000 characters
- Free tier: 10,000 characters/month
- Pro tier: 100,000 chars/month for $5
FAQ
Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.
Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.
Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.
Source & Thanks
Built by Cartesia. Docs at docs.cartesia.ai.
cartesia-ai/cartesia-python — official SDK