Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsMay 11, 2026·5 min de lectura

Deepgram Voice Agent API — Unified STT+LLM+TTS

Deepgram Voice Agent API bundles STT + your LLM + Aura TTS into one WebSocket. Full-duplex voice. Turn detection and barge-in configurable.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 17/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Stage only
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install 848be675-d14d-45b0-a887-9c440d433ee7
Introducción

The Deepgram Voice Agent API bundles Deepgram STT (Nova-3), your LLM of choice (Anthropic, OpenAI, Groq, AWS Bedrock), and Deepgram Aura TTS into one WebSocket connection. Send mic audio in, receive agent audio out — turn detection, barge-in, function calling all handled. Best for: voice agents that don't need component-level swap, fast launch, single-vendor billing. Works with: any WebSocket-capable platform; Python, JS, Go SDKs. Setup time: 15 minutes.


Set up the WebSocket agent

import asyncio, json
from deepgram import DeepgramClient

dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

agent_config = {
    "type": "SettingsConfiguration",
    "audio": {
        "input":  {"encoding": "linear16", "sample_rate": 16000},
        "output": {"encoding": "linear16", "sample_rate": 24000, "container": "none"},
    },
    "agent": {
        "listen": {"model": "nova-3"},
        "speak":  {"model": "aura-2-luna-en"},
        "think": {
            "provider": {"type": "anthropic"},
            "model":     "claude-3-5-sonnet-20241022",
            "instructions": "You are a friendly customer support agent for TokRepo. Keep replies under 2 sentences.",
        },
    },
}

async def run_agent():
    agent = dg.agent.websocket.v("1")
    await agent.start(agent_config)

    async def on_audio_output(data: bytes):
        # Play this on speakers (or send to WebRTC peer)
        await play_audio(data)

    agent.on("AudioOutput", on_audio_output)
    agent.on("ConversationText", lambda role, content: print(f"{role}: {content}"))
    agent.on("UserStartedSpeaking", lambda: print("user speaking — barge-in"))

    # Feed mic audio
    async for chunk in mic_audio():
        await agent.send(chunk)

asyncio.run(run_agent())

Function calling

agent_config["agent"]["think"]["functions"] = [{
    "name": "lookup_order",
    "description": "Look up an order by ID",
    "parameters": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    },
}]

agent.on("FunctionCallRequest", lambda fn_call: handle_function(fn_call))

LLM provider options

Provider Notes
openai gpt-4o, gpt-4o-mini
anthropic claude-3-5-sonnet, haiku
groq Llama 3.3 70B at 280 tok/s — lowest latency
aws_bedrock Bedrock-hosted models (good for regulated AWS shops)
custom Any OpenAI-compatible endpoint

Aura TTS voice cheat sheet

Voice ID Best for
aura-2-luna-en Default — warm American female
aura-2-stella-en Energetic, podcast-style
aura-2-asteria-en Calm British female
aura-2-orion-en Authoritative American male

Voice Agent vs DIY pipeline

Need Choose
Ship fast, single vendor Voice Agent API
Use any TTS / STT / LLM mix DIY (LiveKit Agents)
Need ultra-low TTS latency DIY with Cartesia TTS
Need open-weight LLM at low cost DIY with Groq Llama 3.3

FAQ

Q: How is this different from ElevenLabs ConvAI? A: Both are managed voice agent APIs. Deepgram leans on its in-house STT strength and lets you pick the LLM; ElevenLabs leans on its TTS strength. If STT quality matters more (call centers, noisy audio) → Deepgram. If voice naturalness matters more (consumer brand) → ElevenLabs.

Q: Turn detection — how good? A: Deepgram uses VAD + utterance-end signals (default 1000ms silence threshold). Tune endpointing for snappier (300ms) or more patient (2000ms) cutoffs. Aggressive endpointing risks chopping speech; conservative wastes time.

Q: Pricing model? A: Bundled per-minute billed at conversation duration. Roughly $0.08/min on standard configurations. Cheaper than DIY at low volume; DIY wins at high volume where you optimize per-component costs.


Quick Use

  1. pip install deepgram-sdk
  2. Build agent_config with listen/think/speak sections
  3. dg.agent.websocket.v('1').start(agent_config) + send mic audio + play AudioOutput

Intro

The Deepgram Voice Agent API bundles Deepgram STT (Nova-3), your LLM of choice (Anthropic, OpenAI, Groq, AWS Bedrock), and Deepgram Aura TTS into one WebSocket connection. Send mic audio in, receive agent audio out — turn detection, barge-in, function calling all handled. Best for: voice agents that don't need component-level swap, fast launch, single-vendor billing. Works with: any WebSocket-capable platform; Python, JS, Go SDKs. Setup time: 15 minutes.


Set up the WebSocket agent

import asyncio, json
from deepgram import DeepgramClient

dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

agent_config = {
    "type": "SettingsConfiguration",
    "audio": {
        "input":  {"encoding": "linear16", "sample_rate": 16000},
        "output": {"encoding": "linear16", "sample_rate": 24000, "container": "none"},
    },
    "agent": {
        "listen": {"model": "nova-3"},
        "speak":  {"model": "aura-2-luna-en"},
        "think": {
            "provider": {"type": "anthropic"},
            "model":     "claude-3-5-sonnet-20241022",
            "instructions": "You are a friendly customer support agent for TokRepo. Keep replies under 2 sentences.",
        },
    },
}

async def run_agent():
    agent = dg.agent.websocket.v("1")
    await agent.start(agent_config)

    async def on_audio_output(data: bytes):
        # Play this on speakers (or send to WebRTC peer)
        await play_audio(data)

    agent.on("AudioOutput", on_audio_output)
    agent.on("ConversationText", lambda role, content: print(f"{role}: {content}"))
    agent.on("UserStartedSpeaking", lambda: print("user speaking — barge-in"))

    # Feed mic audio
    async for chunk in mic_audio():
        await agent.send(chunk)

asyncio.run(run_agent())

Function calling

agent_config["agent"]["think"]["functions"] = [{
    "name": "lookup_order",
    "description": "Look up an order by ID",
    "parameters": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    },
}]

agent.on("FunctionCallRequest", lambda fn_call: handle_function(fn_call))

LLM provider options

Provider Notes
openai gpt-4o, gpt-4o-mini
anthropic claude-3-5-sonnet, haiku
groq Llama 3.3 70B at 280 tok/s — lowest latency
aws_bedrock Bedrock-hosted models (good for regulated AWS shops)
custom Any OpenAI-compatible endpoint

Aura TTS voice cheat sheet

Voice ID Best for
aura-2-luna-en Default — warm American female
aura-2-stella-en Energetic, podcast-style
aura-2-asteria-en Calm British female
aura-2-orion-en Authoritative American male

Voice Agent vs DIY pipeline

Need Choose
Ship fast, single vendor Voice Agent API
Use any TTS / STT / LLM mix DIY (LiveKit Agents)
Need ultra-low TTS latency DIY with Cartesia TTS
Need open-weight LLM at low cost DIY with Groq Llama 3.3

FAQ

Q: How is this different from ElevenLabs ConvAI? A: Both are managed voice agent APIs. Deepgram leans on its in-house STT strength and lets you pick the LLM; ElevenLabs leans on its TTS strength. If STT quality matters more (call centers, noisy audio) → Deepgram. If voice naturalness matters more (consumer brand) → ElevenLabs.

Q: Turn detection — how good? A: Deepgram uses VAD + utterance-end signals (default 1000ms silence threshold). Tune endpointing for snappier (300ms) or more patient (2000ms) cutoffs. Aggressive endpointing risks chopping speech; conservative wastes time.

Q: Pricing model? A: Bundled per-minute billed at conversation duration. Roughly $0.08/min on standard configurations. Cheaper than DIY at low volume; DIY wins at high volume where you optimize per-component costs.


Source & Thanks

Built by Deepgram. Voice Agent docs at developers.deepgram.com/docs/voice-agent.

deepgram/deepgram-python-sdk

🙏

Fuente y agradecimientos

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados