KnowledgeMay 11, 2026·4 min read

Cartesia Sonic TTS — 75ms Time-to-First-Audio

Cartesia Sonic is a state-space-model TTS with 75ms time-to-first-audio. 100+ voices, 5s cloning, streaming WebSocket. Lowest-latency TTS.

Agent ready

Safe staging for this asset

This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.

Stage only · 27/100Policy: stage
Agent surface
Any MCP/CLI agent
Kind
Knowledge
Install
Stage only
Trust
Trust: Community
Entrypoint
Asset
Safe staging command
npx -y tokrepo@latest install 48e00964-c223-46ba-a45e-3ef76fbce082 --target codex

Stages files first; activation requires review of the staged README and plan.

Intro

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.


Basic synthesis (single audio buffer)

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)

Streaming WebSocket (lowest latency)

import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))

Voice control (speed + emotion)

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)

Latency vs others (May 2026, p50)

Provider Time to first audio
Cartesia Sonic 75ms
Deepgram Aura ~250ms
ElevenLabs Turbo v2.5 ~280ms
OpenAI TTS-1 ~400ms
Google Cloud TTS ~500ms

Cost (May 2026)

  • Pay-as-you-go: $0.025 / 1,000 characters
  • Free tier: 10,000 characters/month
  • Pro tier: 100,000 chars/month for $5

FAQ

Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.


Quick Use

  1. pip install cartesia and get CARTESIA_API_KEY at play.cartesia.ai
  2. client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT) for batch
  3. client.tts.websocket() for sub-75ms streaming voice agent latency

Intro

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.


Basic synthesis (single audio buffer)

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)

Streaming WebSocket (lowest latency)

import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))

Voice control (speed + emotion)

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)

Latency vs others (May 2026, p50)

Provider Time to first audio
Cartesia Sonic 75ms
Deepgram Aura ~250ms
ElevenLabs Turbo v2.5 ~280ms
OpenAI TTS-1 ~400ms
Google Cloud TTS ~500ms

Cost (May 2026)

  • Pay-as-you-go: $0.025 / 1,000 characters
  • Free tier: 10,000 characters/month
  • Pro tier: 100,000 chars/month for $5

FAQ

Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.


Source & Thanks

Built by Cartesia. Docs at docs.cartesia.ai.

cartesia-ai/cartesia-python — official SDK

🙏

Source & Thanks

Built by Cartesia. Docs at docs.cartesia.ai.

cartesia-ai/cartesia-python — official SDK

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets