ScriptsMay 11, 2026·5 min read

Deepgram Aura TTS — Text-to-Speech for Voice Agents

Deepgram Aura TTS produces natural English TTS with 250ms TTFA. Streaming WebSocket, 12 voices, tuned for conversational agents not narration.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 17/100Stage only
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Stage only
Trust
Trust: New
Entrypoint
Asset
Universal CLI install command
npx tokrepo install 12787f56-eff3-402d-a1f7-1eb2ce567400
Intro

Aura is Deepgram's TTS — purpose-built for conversational voice agents rather than long-form narration. 250ms time-to-first-audio, 12 English voices tuned for natural turn-taking, streaming WebSocket and REST APIs. Pairs natively with Deepgram STT for a low-friction single-vendor voice stack. Best for: customer support voice agents, IVR replacement, voice copilots where 'sounds like a real person on a phone' beats 'audiobook quality'. Works with: Deepgram SDKs, REST, WebSocket, Voice Agent API. Setup time: 5 minutes.


Single audio buffer (REST)

import requests

resp = requests.post(
    "https://api.deepgram.com/v1/speak",
    headers={"Authorization": f"Token {os.environ['DEEPGRAM_API_KEY']}", "Content-Type": "application/json"},
    params={"model": "aura-2-luna-en", "encoding": "mp3"},
    json={"text": "Welcome back to TokRepo. You have three new asset notifications."},
)
with open("welcome.mp3", "wb") as f:
    f.write(resp.content)

Streaming WebSocket

import asyncio
from deepgram import DeepgramClient
import sounddevice as sd
import numpy as np

dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

async def stream():
    ws = dg.speak.websocket.v("1")
    await ws.start({
        "model": "aura-2-luna-en",
        "encoding": "linear16",
        "sample_rate": 24000,
    })

    ws.on("AudioData", lambda data: sd.play(np.frombuffer(data, dtype=np.int16), 24000, blocking=False))

    await ws.send_text("Hi there! How can I help you today?")
    await ws.flush()
    await ws.wait_for_complete()
    await ws.finish()

asyncio.run(stream())

Voice catalog (Aura 2)

Voice ID Description
aura-2-luna-en Warm American female, default
aura-2-stella-en Bright American female, podcast energy
aura-2-orion-en Deep American male, authoritative
aura-2-arcas-en Mid-30s American male, conversational
aura-2-asteria-en Calm British female
aura-2-hera-en Professional American female, customer-service
aura-2-helios-en Warm British male
aura-2-perseus-en American male, neutral

Spanish, French, German, and Portuguese voices added throughout 2026 — check developers.deepgram.com/docs/text-to-speech for current language list.

Aura vs ElevenLabs vs Cartesia

Quality Aura ElevenLabs Cartesia
Time to first audio ~250ms ~280ms ~75ms
English naturalness High Highest High
Long-form narration Fair Excellent Good
Conversational fit Excellent Excellent Excellent
Languages EN (more in 2026) 32 15
Per-minute cost $0.015 $0.015-0.18 $0.025/1k chars

Pricing

  • Aura TTS: $0.015/min equivalent (~$0.030/1k characters)
  • Free tier: $200 credit at signup
  • Voice Agent API bundles STT+LLM+TTS at one per-minute rate

FAQ

Q: Why use Aura over ElevenLabs? A: Single vendor (one bill, one SLA) when paired with Deepgram STT. Faster TTFA than ElevenLabs Turbo. Voice library is smaller — pick ElevenLabs for character voice diversity or 32-language coverage.

Q: Does Aura support SSML? A: Limited SSML support — pauses, emphasis, basic prosody. Full SSML like phoneme tags isn't there. For complex prosodic control, ElevenLabs or Cartesia have richer markup.

Q: Voice cloning? A: Not yet on Aura — voices are curated. ElevenLabs and Cartesia both offer cloning. If branded custom voice is critical, those are the platforms. If catalog voices suffice, Aura's quality + latency wins for agents.


Quick Use

  1. POST /v1/speak?model=aura-2-luna-en with JSON {text} for batch
  2. WebSocket dg.speak.websocket.v('1') for streaming voice agents
  3. Pair with Deepgram STT or use Voice Agent API for full stack

Intro

Aura is Deepgram's TTS — purpose-built for conversational voice agents rather than long-form narration. 250ms time-to-first-audio, 12 English voices tuned for natural turn-taking, streaming WebSocket and REST APIs. Pairs natively with Deepgram STT for a low-friction single-vendor voice stack. Best for: customer support voice agents, IVR replacement, voice copilots where 'sounds like a real person on a phone' beats 'audiobook quality'. Works with: Deepgram SDKs, REST, WebSocket, Voice Agent API. Setup time: 5 minutes.


Single audio buffer (REST)

import requests

resp = requests.post(
    "https://api.deepgram.com/v1/speak",
    headers={"Authorization": f"Token {os.environ['DEEPGRAM_API_KEY']}", "Content-Type": "application/json"},
    params={"model": "aura-2-luna-en", "encoding": "mp3"},
    json={"text": "Welcome back to TokRepo. You have three new asset notifications."},
)
with open("welcome.mp3", "wb") as f:
    f.write(resp.content)

Streaming WebSocket

import asyncio
from deepgram import DeepgramClient
import sounddevice as sd
import numpy as np

dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

async def stream():
    ws = dg.speak.websocket.v("1")
    await ws.start({
        "model": "aura-2-luna-en",
        "encoding": "linear16",
        "sample_rate": 24000,
    })

    ws.on("AudioData", lambda data: sd.play(np.frombuffer(data, dtype=np.int16), 24000, blocking=False))

    await ws.send_text("Hi there! How can I help you today?")
    await ws.flush()
    await ws.wait_for_complete()
    await ws.finish()

asyncio.run(stream())

Voice catalog (Aura 2)

Voice ID Description
aura-2-luna-en Warm American female, default
aura-2-stella-en Bright American female, podcast energy
aura-2-orion-en Deep American male, authoritative
aura-2-arcas-en Mid-30s American male, conversational
aura-2-asteria-en Calm British female
aura-2-hera-en Professional American female, customer-service
aura-2-helios-en Warm British male
aura-2-perseus-en American male, neutral

Spanish, French, German, and Portuguese voices added throughout 2026 — check developers.deepgram.com/docs/text-to-speech for current language list.

Aura vs ElevenLabs vs Cartesia

Quality Aura ElevenLabs Cartesia
Time to first audio ~250ms ~280ms ~75ms
English naturalness High Highest High
Long-form narration Fair Excellent Good
Conversational fit Excellent Excellent Excellent
Languages EN (more in 2026) 32 15
Per-minute cost $0.015 $0.015-0.18 $0.025/1k chars

Pricing

  • Aura TTS: $0.015/min equivalent (~$0.030/1k characters)
  • Free tier: $200 credit at signup
  • Voice Agent API bundles STT+LLM+TTS at one per-minute rate

FAQ

Q: Why use Aura over ElevenLabs? A: Single vendor (one bill, one SLA) when paired with Deepgram STT. Faster TTFA than ElevenLabs Turbo. Voice library is smaller — pick ElevenLabs for character voice diversity or 32-language coverage.

Q: Does Aura support SSML? A: Limited SSML support — pauses, emphasis, basic prosody. Full SSML like phoneme tags isn't there. For complex prosodic control, ElevenLabs or Cartesia have richer markup.

Q: Voice cloning? A: Not yet on Aura — voices are curated. ElevenLabs and Cartesia both offer cloning. If branded custom voice is critical, those are the platforms. If catalog voices suffice, Aura's quality + latency wins for agents.


Source & Thanks

Built by Deepgram. Aura TTS docs at developers.deepgram.com/docs/text-to-speech.

deepgram/deepgram-python-sdk

🙏

Source & Thanks

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets