# Deepgram Voice Agent API — Unified STT+LLM+TTS > Deepgram Voice Agent API bundles STT + your LLM + Aura TTS into one WebSocket. Full-duplex voice. Turn detection and barge-in configurable. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: ## Quick Use 1. `pip install deepgram-sdk` 2. Build agent_config with listen/think/speak sections 3. `dg.agent.websocket.v('1').start(agent_config)` + send mic audio + play AudioOutput --- ## Intro The Deepgram Voice Agent API bundles Deepgram STT (Nova-3), your LLM of choice (Anthropic, OpenAI, Groq, AWS Bedrock), and Deepgram Aura TTS into one WebSocket connection. Send mic audio in, receive agent audio out — turn detection, barge-in, function calling all handled. Best for: voice agents that don't need component-level swap, fast launch, single-vendor billing. Works with: any WebSocket-capable platform; Python, JS, Go SDKs. Setup time: 15 minutes. --- ### Set up the WebSocket agent ```python import asyncio, json from deepgram import DeepgramClient dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"]) agent_config = { "type": "SettingsConfiguration", "audio": { "input": {"encoding": "linear16", "sample_rate": 16000}, "output": {"encoding": "linear16", "sample_rate": 24000, "container": "none"}, }, "agent": { "listen": {"model": "nova-3"}, "speak": {"model": "aura-2-luna-en"}, "think": { "provider": {"type": "anthropic"}, "model": "claude-3-5-sonnet-20241022", "instructions": "You are a friendly customer support agent for TokRepo. Keep replies under 2 sentences.", }, }, } async def run_agent(): agent = dg.agent.websocket.v("1") await agent.start(agent_config) async def on_audio_output(data: bytes): # Play this on speakers (or send to WebRTC peer) await play_audio(data) agent.on("AudioOutput", on_audio_output) agent.on("ConversationText", lambda role, content: print(f"{role}: {content}")) agent.on("UserStartedSpeaking", lambda: print("user speaking — barge-in")) # Feed mic audio async for chunk in mic_audio(): await agent.send(chunk) asyncio.run(run_agent()) ``` ### Function calling ```python agent_config["agent"]["think"]["functions"] = [{ "name": "lookup_order", "description": "Look up an order by ID", "parameters": { "type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"], }, }] agent.on("FunctionCallRequest", lambda fn_call: handle_function(fn_call)) ``` ### LLM provider options | Provider | Notes | |---|---| | `openai` | gpt-4o, gpt-4o-mini | | `anthropic` | claude-3-5-sonnet, haiku | | `groq` | Llama 3.3 70B at 280 tok/s — lowest latency | | `aws_bedrock` | Bedrock-hosted models (good for regulated AWS shops) | | `custom` | Any OpenAI-compatible endpoint | ### Aura TTS voice cheat sheet | Voice ID | Best for | |---|---| | `aura-2-luna-en` | Default — warm American female | | `aura-2-stella-en` | Energetic, podcast-style | | `aura-2-asteria-en` | Calm British female | | `aura-2-orion-en` | Authoritative American male | ### Voice Agent vs DIY pipeline | Need | Choose | |---|---| | Ship fast, single vendor | **Voice Agent API** | | Use any TTS / STT / LLM mix | DIY (LiveKit Agents) | | Need ultra-low TTS latency | DIY with Cartesia TTS | | Need open-weight LLM at low cost | DIY with Groq Llama 3.3 | --- ### FAQ **Q: How is this different from ElevenLabs ConvAI?** A: Both are managed voice agent APIs. Deepgram leans on its in-house STT strength and lets you pick the LLM; ElevenLabs leans on its TTS strength. If STT quality matters more (call centers, noisy audio) → Deepgram. If voice naturalness matters more (consumer brand) → ElevenLabs. **Q: Turn detection — how good?** A: Deepgram uses VAD + utterance-end signals (default 1000ms silence threshold). Tune `endpointing` for snappier (300ms) or more patient (2000ms) cutoffs. Aggressive endpointing risks chopping speech; conservative wastes time. **Q: Pricing model?** A: Bundled per-minute billed at conversation duration. Roughly $0.08/min on standard configurations. Cheaper than DIY at low volume; DIY wins at high volume where you optimize per-component costs. --- ## Source & Thanks > Built by [Deepgram](https://github.com/deepgram). Voice Agent docs at [developers.deepgram.com/docs/voice-agent](https://developers.deepgram.com). > > [deepgram/deepgram-python-sdk](https://github.com/deepgram/deepgram-python-sdk) --- ## 快速使用 1. `pip install deepgram-sdk` 2. 构造 agent_config,含 listen / think / speak 段 3. `dg.agent.websocket.v('1').start(agent_config)` + 发麦克风音频 + 播 AudioOutput --- ## 简介 Deepgram Voice Agent API 把 Deepgram STT(Nova-3)、你选的 LLM(Anthropic / OpenAI / Groq / AWS Bedrock)、Deepgram Aura TTS 打包到一个 WebSocket 连接。麦克风音频进、agent 音频出 —— 回合检测、打断、function calling 都处理好。适合不需要组件级切换的语音 agent、快速上线、单厂商账单。任何能跑 WebSocket 的平台都行;Python / JS / Go SDK。装机时间 15 分钟。 --- ### 配 WebSocket agent ```python import asyncio, json from deepgram import DeepgramClient dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"]) agent_config = { "type": "SettingsConfiguration", "audio": { "input": {"encoding": "linear16", "sample_rate": 16000}, "output": {"encoding": "linear16", "sample_rate": 24000, "container": "none"}, }, "agent": { "listen": {"model": "nova-3"}, "speak": {"model": "aura-2-luna-en"}, "think": { "provider": {"type": "anthropic"}, "model": "claude-3-5-sonnet-20241022", "instructions": "你是 TokRepo 的友好客服 agent。回复不超 2 句。", }, }, } async def run_agent(): agent = dg.agent.websocket.v("1") await agent.start(agent_config) async def on_audio_output(data: bytes): # 在扬声器播(或发到 WebRTC peer) await play_audio(data) agent.on("AudioOutput", on_audio_output) agent.on("ConversationText", lambda role, content: print(f"{role}: {content}")) agent.on("UserStartedSpeaking", lambda: print("用户开口 —— 打断")) # 喂麦克风音频 async for chunk in mic_audio(): await agent.send(chunk) asyncio.run(run_agent()) ``` ### Function calling ```python agent_config["agent"]["think"]["functions"] = [{ "name": "lookup_order", "description": "按订单号查", "parameters": { "type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"], }, }] agent.on("FunctionCallRequest", lambda fn_call: handle_function(fn_call)) ``` ### LLM 提供商选项 | Provider | 备注 | |---|---| | `openai` | gpt-4o、gpt-4o-mini | | `anthropic` | claude-3-5-sonnet、haiku | | `groq` | Llama 3.3 70B 280 tok/秒 —— 最低延迟 | | `aws_bedrock` | Bedrock 托管模型(合规 AWS 店适合)| | `custom` | 任何 OpenAI 兼容 endpoint | ### Aura TTS 嗓音 cheat sheet | 嗓音 ID | 最佳用途 | |---|---| | `aura-2-luna-en` | 默认 —— 温暖美式女声 | | `aura-2-stella-en` | 活力、播客风 | | `aura-2-asteria-en` | 平静英式女声 | | `aura-2-orion-en` | 权威美式男声 | ### Voice Agent vs DIY 流水线 | 需求 | 选择 | |---|---| | 上线快、单厂商 | **Voice Agent API** | | 任意 TTS / STT / LLM 组合 | DIY(LiveKit Agents)| | 要超低 TTS 延迟 | DIY 配 Cartesia TTS | | 要低成本开源权重 LLM | DIY 配 Groq Llama 3.3 | --- ### FAQ **Q: 跟 ElevenLabs ConvAI 啥区别?** A: 都是托管语音 agent API。Deepgram 靠自家 STT 实力 + 让你选 LLM;ElevenLabs 靠自家 TTS 实力。STT 质量更重要(呼叫中心、嘈杂)→ Deepgram。嗓音自然度更重要(消费品牌)→ ElevenLabs。 **Q: 回合检测多准?** A: Deepgram 用 VAD + utterance-end 信号(默认 1000ms 静音阈值)。调 `endpointing` 让截断更快(300ms)或更耐心(2000ms)。激进 endpointing 风险切话;保守浪费时间。 **Q: 价格模型?** A: 按对话时长计费。标准配置约 $0.08/分钟。低量比 DIY 便宜;高量 DIY 赢,能按组件优化成本。 --- ## 来源与感谢 > Built by [Deepgram](https://github.com/deepgram). Voice Agent docs at [developers.deepgram.com/docs/voice-agent](https://developers.deepgram.com). > > [deepgram/deepgram-python-sdk](https://github.com/deepgram/deepgram-python-sdk) --- Source: https://tokrepo.com/en/workflows/deepgram-voice-agent-api-unified-stt-llm-tts Author: Deepgram