Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 11, 2026·5 min de lecture

Deepgram Voice Agent API — Unified STT+LLM+TTS

Deepgram Voice Agent API bundles STT + your LLM + Aura TTS into one WebSocket. Full-duplex voice. Turn detection and barge-in configurable.

Deepgram
Deepgram · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 17/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 848be675-d14d-45b0-a887-9c440d433ee7
Introduction

The Deepgram Voice Agent API bundles Deepgram STT (Nova-3), your LLM of choice (Anthropic, OpenAI, Groq, AWS Bedrock), and Deepgram Aura TTS into one WebSocket connection. Send mic audio in, receive agent audio out — turn detection, barge-in, function calling all handled. Best for: voice agents that don't need component-level swap, fast launch, single-vendor billing. Works with: any WebSocket-capable platform; Python, JS, Go SDKs. Setup time: 15 minutes.


Set up the WebSocket agent

import asyncio, json
from deepgram import DeepgramClient

dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

agent_config = {
    "type": "SettingsConfiguration",
    "audio": {
        "input":  {"encoding": "linear16", "sample_rate": 16000},
        "output": {"encoding": "linear16", "sample_rate": 24000, "container": "none"},
    },
    "agent": {
        "listen": {"model": "nova-3"},
        "speak":  {"model": "aura-2-luna-en"},
        "think": {
            "provider": {"type": "anthropic"},
            "model":     "claude-3-5-sonnet-20241022",
            "instructions": "You are a friendly customer support agent for TokRepo. Keep replies under 2 sentences.",
        },
    },
}

async def run_agent():
    agent = dg.agent.websocket.v("1")
    await agent.start(agent_config)

    async def on_audio_output(data: bytes):
        # Play this on speakers (or send to WebRTC peer)
        await play_audio(data)

    agent.on("AudioOutput", on_audio_output)
    agent.on("ConversationText", lambda role, content: print(f"{role}: {content}"))
    agent.on("UserStartedSpeaking", lambda: print("user speaking — barge-in"))

    # Feed mic audio
    async for chunk in mic_audio():
        await agent.send(chunk)

asyncio.run(run_agent())

Function calling

agent_config["agent"]["think"]["functions"] = [{
    "name": "lookup_order",
    "description": "Look up an order by ID",
    "parameters": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    },
}]

agent.on("FunctionCallRequest", lambda fn_call: handle_function(fn_call))

LLM provider options

Provider Notes
openai gpt-4o, gpt-4o-mini
anthropic claude-3-5-sonnet, haiku
groq Llama 3.3 70B at 280 tok/s — lowest latency
aws_bedrock Bedrock-hosted models (good for regulated AWS shops)
custom Any OpenAI-compatible endpoint

Aura TTS voice cheat sheet

Voice ID Best for
aura-2-luna-en Default — warm American female
aura-2-stella-en Energetic, podcast-style
aura-2-asteria-en Calm British female
aura-2-orion-en Authoritative American male

Voice Agent vs DIY pipeline

Need Choose
Ship fast, single vendor Voice Agent API
Use any TTS / STT / LLM mix DIY (LiveKit Agents)
Need ultra-low TTS latency DIY with Cartesia TTS
Need open-weight LLM at low cost DIY with Groq Llama 3.3

FAQ

Q: How is this different from ElevenLabs ConvAI? A: Both are managed voice agent APIs. Deepgram leans on its in-house STT strength and lets you pick the LLM; ElevenLabs leans on its TTS strength. If STT quality matters more (call centers, noisy audio) → Deepgram. If voice naturalness matters more (consumer brand) → ElevenLabs.

Q: Turn detection — how good? A: Deepgram uses VAD + utterance-end signals (default 1000ms silence threshold). Tune endpointing for snappier (300ms) or more patient (2000ms) cutoffs. Aggressive endpointing risks chopping speech; conservative wastes time.

Q: Pricing model? A: Bundled per-minute billed at conversation duration. Roughly $0.08/min on standard configurations. Cheaper than DIY at low volume; DIY wins at high volume where you optimize per-component costs.


Quick Use

  1. pip install deepgram-sdk
  2. Build agent_config with listen/think/speak sections
  3. dg.agent.websocket.v('1').start(agent_config) + send mic audio + play AudioOutput

Intro

The Deepgram Voice Agent API bundles Deepgram STT (Nova-3), your LLM of choice (Anthropic, OpenAI, Groq, AWS Bedrock), and Deepgram Aura TTS into one WebSocket connection. Send mic audio in, receive agent audio out — turn detection, barge-in, function calling all handled. Best for: voice agents that don't need component-level swap, fast launch, single-vendor billing. Works with: any WebSocket-capable platform; Python, JS, Go SDKs. Setup time: 15 minutes.


Set up the WebSocket agent

import asyncio, json
from deepgram import DeepgramClient

dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

agent_config = {
    "type": "SettingsConfiguration",
    "audio": {
        "input":  {"encoding": "linear16", "sample_rate": 16000},
        "output": {"encoding": "linear16", "sample_rate": 24000, "container": "none"},
    },
    "agent": {
        "listen": {"model": "nova-3"},
        "speak":  {"model": "aura-2-luna-en"},
        "think": {
            "provider": {"type": "anthropic"},
            "model":     "claude-3-5-sonnet-20241022",
            "instructions": "You are a friendly customer support agent for TokRepo. Keep replies under 2 sentences.",
        },
    },
}

async def run_agent():
    agent = dg.agent.websocket.v("1")
    await agent.start(agent_config)

    async def on_audio_output(data: bytes):
        # Play this on speakers (or send to WebRTC peer)
        await play_audio(data)

    agent.on("AudioOutput", on_audio_output)
    agent.on("ConversationText", lambda role, content: print(f"{role}: {content}"))
    agent.on("UserStartedSpeaking", lambda: print("user speaking — barge-in"))

    # Feed mic audio
    async for chunk in mic_audio():
        await agent.send(chunk)

asyncio.run(run_agent())

Function calling

agent_config["agent"]["think"]["functions"] = [{
    "name": "lookup_order",
    "description": "Look up an order by ID",
    "parameters": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    },
}]

agent.on("FunctionCallRequest", lambda fn_call: handle_function(fn_call))

LLM provider options

Provider Notes
openai gpt-4o, gpt-4o-mini
anthropic claude-3-5-sonnet, haiku
groq Llama 3.3 70B at 280 tok/s — lowest latency
aws_bedrock Bedrock-hosted models (good for regulated AWS shops)
custom Any OpenAI-compatible endpoint

Aura TTS voice cheat sheet

Voice ID Best for
aura-2-luna-en Default — warm American female
aura-2-stella-en Energetic, podcast-style
aura-2-asteria-en Calm British female
aura-2-orion-en Authoritative American male

Voice Agent vs DIY pipeline

Need Choose
Ship fast, single vendor Voice Agent API
Use any TTS / STT / LLM mix DIY (LiveKit Agents)
Need ultra-low TTS latency DIY with Cartesia TTS
Need open-weight LLM at low cost DIY with Groq Llama 3.3

FAQ

Q: How is this different from ElevenLabs ConvAI? A: Both are managed voice agent APIs. Deepgram leans on its in-house STT strength and lets you pick the LLM; ElevenLabs leans on its TTS strength. If STT quality matters more (call centers, noisy audio) → Deepgram. If voice naturalness matters more (consumer brand) → ElevenLabs.

Q: Turn detection — how good? A: Deepgram uses VAD + utterance-end signals (default 1000ms silence threshold). Tune endpointing for snappier (300ms) or more patient (2000ms) cutoffs. Aggressive endpointing risks chopping speech; conservative wastes time.

Q: Pricing model? A: Bundled per-minute billed at conversation duration. Roughly $0.08/min on standard configurations. Cheaper than DIY at low volume; DIY wins at high volume where you optimize per-component costs.


Source & Thanks

Built by Deepgram. Voice Agent docs at developers.deepgram.com/docs/voice-agent.

deepgram/deepgram-python-sdk

🙏

Source et remerciements

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires