What is Cartesia Sonic TTS — 75ms Time-to-First-Audio?

Cartesia Sonic is a state-space-model TTS with 75ms time-to-first-audio. 100+ voices, 5s cloning, streaming WebSocket. Lowest-latency TTS.

Is Cartesia Sonic TTS — 75ms Time-to-First-Audio free to use?

Yes. Cartesia Sonic TTS — 75ms Time-to-First-Audio is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Cartesia Sonic TTS — 75ms Time-to-First-Audio?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cartesia Sonic TTS — 75ms Time-to-First-Audio

Name: Cartesia Sonic TTS — 75ms Time-to-First-Audio
Author: Cartesia

from cartesia import Cartesia client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"]) audio = client.tts.bytes( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # "Helpful Woman" transcript="Welcome back to TokRepo. You have three new asset notifications.", output_format={"container": "mp3", "sample_rate": 44_100}, language="en", ) with open("welcome.mp3", "wb") as f: f.write(audio)

import asyncio import sounddevice as sd import numpy as np async def stream_tts(text: str): ws = await client.tts.websocket() audio_chunks = [] async for chunk in ws.send( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", transcript=text, output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050}, ): audio = np.frombuffer(chunk.audio, dtype=np.int16) sd.play(audio, 22_050) # play as it arrives await ws.close() asyncio.run(stream_tts("Hi there! What can I help with today?"))

audio = client.tts.bytes( model_id="sonic-2", voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", transcript="Thank you for your patience — we'll have an answer for you soon.", voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}}, output_format={"container": "mp3"}, )

Provider

Time to first audio

Cartesia Sonic

75ms

Deepgram Aura

~250ms

ElevenLabs Turbo v2.5

~280ms

OpenAI TTS-1

~400ms

Google Cloud TTS

~500ms

Quick Use

pip install cartesia and get CARTESIA_API_KEY at play.cartesia.ai
client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT) for batch
client.tts.websocket() for sub-75ms streaming voice agent latency

Intro

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.

Basic synthesis (single audio buffer)

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)

Streaming WebSocket (lowest latency)

import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))

Voice control (speed + emotion)

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)

Latency vs others (May 2026, p50)

Provider	Time to first audio
Cartesia Sonic	75ms
Deepgram Aura	~250ms
ElevenLabs Turbo v2.5	~280ms
OpenAI TTS-1	~400ms
Google Cloud TTS	~500ms

Cost (May 2026)

Pay-as-you-go: $0.025 / 1,000 characters
Free tier: 10,000 characters/month
Pro tier: 100,000 chars/month for $5

FAQ

Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.

Source & Thanks

Built by Cartesia. Docs at docs.cartesia.ai.

cartesia-ai/cartesia-python — official SDK

Cartesia Sonic TTS — 75ms Time-to-First-Audio

Este activo puede ser leído e instalado directamente por agents

Basic synthesis (single audio buffer)

Streaming WebSocket (lowest latency)

Voice control (speed + emotion)

Latency vs others (May 2026, p50)

Cost (May 2026)

FAQ

Quick Use

Intro

Basic synthesis (single audio buffer)

Streaming WebSocket (lowest latency)

Voice control (speed + emotion)

Latency vs others (May 2026, p50)

Cost (May 2026)

FAQ

Source & Thanks

Fuente y agradecimientos

Discusión

Activos relacionados

ElevenLabs Voice Design — Generate Voices from Prompts

Helicone Sessions — Group LLM Calls by User Conversation

SWE-bench — Benchmark for Coding Agents

Awesome Context Engineering — Prompt to Production