Esta página se muestra en inglés. Una traducción al español está en curso.
KnowledgeMay 11, 2026·4 min de lectura

Cartesia Sonic TTS — 75ms Time-to-First-Audio

Cartesia Sonic is a state-space-model TTS with 75ms time-to-first-audio. 100+ voices, 5s cloning, streaming WebSocket. Lowest-latency TTS.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 15/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
Knowledge
Instalación
Stage only
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install 48e00964-c223-46ba-a45e-3ef76fbce082
Introducción

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.


Basic synthesis (single audio buffer)

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)

Streaming WebSocket (lowest latency)

import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))

Voice control (speed + emotion)

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)

Latency vs others (May 2026, p50)

Provider Time to first audio
Cartesia Sonic 75ms
Deepgram Aura ~250ms
ElevenLabs Turbo v2.5 ~280ms
OpenAI TTS-1 ~400ms
Google Cloud TTS ~500ms

Cost (May 2026)

  • Pay-as-you-go: $0.025 / 1,000 characters
  • Free tier: 10,000 characters/month
  • Pro tier: 100,000 chars/month for $5

FAQ

Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.


Quick Use

  1. pip install cartesia and get CARTESIA_API_KEY at play.cartesia.ai
  2. client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT) for batch
  3. client.tts.websocket() for sub-75ms streaming voice agent latency

Intro

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.


Basic synthesis (single audio buffer)

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)

Streaming WebSocket (lowest latency)

import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))

Voice control (speed + emotion)

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)

Latency vs others (May 2026, p50)

Provider Time to first audio
Cartesia Sonic 75ms
Deepgram Aura ~250ms
ElevenLabs Turbo v2.5 ~280ms
OpenAI TTS-1 ~400ms
Google Cloud TTS ~500ms

Cost (May 2026)

  • Pay-as-you-go: $0.025 / 1,000 characters
  • Free tier: 10,000 characters/month
  • Pro tier: 100,000 chars/month for $5

FAQ

Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.


Source & Thanks

Built by Cartesia. Docs at docs.cartesia.ai.

cartesia-ai/cartesia-python — official SDK

🙏

Fuente y agradecimientos

Built by Cartesia. Docs at docs.cartesia.ai.

cartesia-ai/cartesia-python — official SDK

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados