Cette page est affichée en anglais. Une traduction française est en cours.
KnowledgeMay 11, 2026·4 min de lecture

Cartesia Sonic TTS — 75ms Time-to-First-Audio

Cartesia Sonic is a state-space-model TTS with 75ms time-to-first-audio. 100+ voices, 5s cloning, streaming WebSocket. Lowest-latency TTS.

Cartesia
Cartesia · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 15/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Knowledge
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 48e00964-c223-46ba-a45e-3ef76fbce082
Introduction

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.


Basic synthesis (single audio buffer)

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)

Streaming WebSocket (lowest latency)

import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))

Voice control (speed + emotion)

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)

Latency vs others (May 2026, p50)

Provider Time to first audio
Cartesia Sonic 75ms
Deepgram Aura ~250ms
ElevenLabs Turbo v2.5 ~280ms
OpenAI TTS-1 ~400ms
Google Cloud TTS ~500ms

Cost (May 2026)

  • Pay-as-you-go: $0.025 / 1,000 characters
  • Free tier: 10,000 characters/month
  • Pro tier: 100,000 chars/month for $5

FAQ

Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.


Quick Use

  1. pip install cartesia and get CARTESIA_API_KEY at play.cartesia.ai
  2. client.tts.bytes(model_id='sonic-2', voice_id=ID, transcript=TEXT) for batch
  3. client.tts.websocket() for sub-75ms streaming voice agent latency

Intro

Cartesia Sonic is a production TTS built on state-space models (not transformers) — 75ms time-to-first-audio, the lowest of any commercial TTS. 100+ pre-built voices, instant voice cloning from a 5-second sample, streaming WebSocket API, 15 languages, controllable speed and emotion. Best for: voice agents where TTS latency dominates round-trip budget, real-time games, fast-response IVRs, multilingual customer support. Works with: official Python SDK, REST, WebSocket; LiveKit / Vapi plugin built in. Setup time: 5 minutes.


Basic synthesis (single audio buffer)

from cartesia import Cartesia
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",   # "Helpful Woman"
    transcript="Welcome back to TokRepo. You have three new asset notifications.",
    output_format={"container": "mp3", "sample_rate": 44_100},
    language="en",
)

with open("welcome.mp3", "wb") as f:
    f.write(audio)

Streaming WebSocket (lowest latency)

import asyncio
import sounddevice as sd
import numpy as np

async def stream_tts(text: str):
    ws = await client.tts.websocket()
    audio_chunks = []
    async for chunk in ws.send(
        model_id="sonic-2",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        transcript=text,
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 22_050},
    ):
        audio = np.frombuffer(chunk.audio, dtype=np.int16)
        sd.play(audio, 22_050)   # play as it arrives
    await ws.close()

asyncio.run(stream_tts("Hi there! What can I help with today?"))

Voice control (speed + emotion)

audio = client.tts.bytes(
    model_id="sonic-2",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    transcript="Thank you for your patience — we'll have an answer for you soon.",
    voice={"__experimental_controls": {"speed": "slow", "emotion": ["positivity:high", "curiosity"]}},
    output_format={"container": "mp3"},
)

Latency vs others (May 2026, p50)

Provider Time to first audio
Cartesia Sonic 75ms
Deepgram Aura ~250ms
ElevenLabs Turbo v2.5 ~280ms
OpenAI TTS-1 ~400ms
Google Cloud TTS ~500ms

Cost (May 2026)

  • Pay-as-you-go: $0.025 / 1,000 characters
  • Free tier: 10,000 characters/month
  • Pro tier: 100,000 chars/month for $5

FAQ

Q: Why is Cartesia so much faster than transformer TTS? A: State-space models have linear inference cost vs sequence length (transformers are quadratic). At short prompts the difference is small; at long-form generation Cartesia generates audio in true streaming with constant time-per-frame. The 75ms TTFA is the architectural payoff.

Q: How good is voice cloning from 5 seconds? A: Surprisingly good for English — recognizable timbre, accent, pace. Non-English source samples need ~10s for similar quality. For high-fidelity character voices, use a 30-second source clip via the Voice Design endpoint.

Q: Cartesia vs ElevenLabs for production? A: Cartesia wins on latency by 200+ms — non-negotiable for voice agents. ElevenLabs wins on naturalness for long-form narration and on language coverage (32 vs 15). For chat-style voice agents → Cartesia. For audiobooks → ElevenLabs.


Source & Thanks

Built by Cartesia. Docs at docs.cartesia.ai.

cartesia-ai/cartesia-python — official SDK

🙏

Source et remerciements

Built by Cartesia. Docs at docs.cartesia.ai.

cartesia-ai/cartesia-python — official SDK

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires