# Cartesia Sonic TTS — 75ms Time-to-First-Audio > Cartesia Sonic is a state-space-model TTS with 75ms time-to-first-audio. 100+ voices, 5s cloning, streaming WebSocket. Lowest-latency TTS. ## Install Copy the content below into your project: ## Quick Use 1. `pip install cartesia` and get CARTESIA_API_KEY at play.cartesia.ai 2. `client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT)` for batch 3. `client.tts.websocket()` for sub-75ms streaming voice agent latency --- ## Intro Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes. --- ### Basic synthesis (single audio buffer) ```python from cartesia import Cartesia client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"]) audio = client.tts.bytes( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # "Helpful Woman" transcript="Welcome back to TokRepo. You have three new asset notifications.", output_format={"container": "mp3", "sample_rate": 44_100}, language="en", ) with open("welcome.mp3", "wb") as f: f.write(audio) ``` ### Streaming WebSocket (lowest latency) ```python import asyncio import sounddevice as sd import numpy as np async def stream_tts(text: str): ws = await client.tts.websocket() audio_chunks = [] async for chunk in ws.send( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", transcript=text, output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050}, ): audio = np.frombuffer(chunk.audio, dtype=np.int16) sd.play(audio, 22_050) # play as it arrives await ws.close() asyncio.run(stream_tts("Hi there! What can I help with today?")) ``` ### Voice control (speed + emotion) ```python audio = client.tts.bytes( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", transcript="Thank you for your patience — we'll have an answer for you soon.", voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}}, output_format={"container": "mp3"}, ) ``` ### Latency vs others (May 2026, p50) | Provider | Time to first audio | |---|---| | **Cartesia Sonic** | **75ms** | | Deepgram Aura | ~250ms | | ElevenLabs Turbo v2.5 | ~280ms | | OpenAI TTS-1 | ~400ms | | Google Cloud TTS | ~500ms | ### Cost (May 2026) - Pay-as-you-go: $0.025 / 1,000 characters - Free tier: 10,000 characters/month - Pro tier: 100,000 chars/month for $5 --- ### FAQ **Q: Why is Cartesia so much faster than transformer TTS?** A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff. **Q: How good is voice cloning from 5 seconds?** A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint. **Q: Cartesia vs ElevenLabs for production?** A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs. --- ## Source & Thanks > Built by [Cartesia](https://github.com/cartesia-ai). Docs at [docs.cartesia.ai](https://docs.cartesia.ai). > > [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python) — official SDK --- ## 快速使用 1. `pip install cartesia`,在 play.cartesia.ai 拿 CARTESIA_API_KEY 2. 批量用 `client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT)` 3. 语音 agent <75ms 流式用 `client.tts.websocket()` --- ## 简介 Cartesia Sonic 是基于状态空间模型(不是 transformer)的生产 TTS —— 首音频 75ms,商用 TTS 里最低。100+ 预置嗓音、5 秒样本即时克隆、流式 WebSocket API、15 种语言、可控语速和情感。适合 TTS 延迟主导往返预算的语音 agent、实时游戏、快速响应 IVR、多语言客服。兼容官方 Python SDK、REST、WebSocket;LiveKit / Vapi 内置插件。装机时间 5 分钟。 --- ### 基础合成(单音频 buffer) ```python from cartesia import Cartesia client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"]) audio = client.tts.bytes( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # "Helpful Woman" transcript="Welcome back to TokRepo. You have three new asset notifications.", output_format={"container": "mp3", "sample_rate": 44_100}, language="en", ) with open("welcome.mp3", "wb") as f: f.write(audio) ``` ### 流式 WebSocket(最低延迟) ```python import asyncio import sounddevice as sd import numpy as np async def stream_tts(text: str): ws = await client.tts.websocket() audio_chunks = [] async for chunk in ws.send( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", transcript=text, output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050}, ): audio = np.frombuffer(chunk.audio, dtype=np.int16) sd.play(audio, 22_050) # 边来边播 await ws.close() asyncio.run(stream_tts("Hi there! What can I help with today?")) ``` ### 嗓音控制(语速 + 情感) ```python audio = client.tts.bytes( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", transcript="Thank you for your patience — we'll have an answer for you soon.", voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}}, output_format={"container": "mp3"}, ) ``` ### 跟同行延迟对比(2026 年 5 月,p50) | 提供商 | 首音频时间 | |---|---| | **Cartesia Sonic** | **75ms** | | Deepgram Aura | ~250ms | | ElevenLabs Turbo v2.5 | ~280ms | | OpenAI TTS-1 | ~400ms | | Google Cloud TTS | ~500ms | ### 成本(2026 年 5 月) - 按用量:$0.025 / 1,000 字符 - 免费档:10,000 字符/月 - Pro:100,000 字符/月 $5 --- ### FAQ **Q: Cartesia 为啥比 transformer TTS 快这么多?** A: 状态空间模型推理成本对序列长度线性(transformer 是平方)。短 prompt 差距小;长生成 Cartesia 真流式生成,每帧时间恒定。75ms TTFA 就是架构红利。 **Q: 5 秒嗓音克隆效果如何?** A: 英语意外地好 —— 音色、口音、节奏可识别。非英语源样本需要 ~10 秒达到同质量。高保真角色嗓音用 30 秒源片段走 Voice Design endpoint。 **Q: 生产环境 Cartesia vs ElevenLabs?** A: Cartesia 延迟赢 200+ms —— 语音 agent 必选。ElevenLabs 长篇旁白更自然、语言覆盖更广(32 vs 15)。聊天式语音 agent → Cartesia。有声书 → ElevenLabs。 --- ## 来源与感谢 > Built by [Cartesia](https://github.com/cartesia-ai). Docs at [docs.cartesia.ai](https://docs.cartesia.ai). > > [cartesia-ai/cartesia-python](https://github.com/cartesia-ai/cartesia-python) — official SDK --- Source: https://tokrepo.com/en/workflows/cartesia-sonic-tts-75ms-time-to-first-audio Author: Cartesia