Quick Use
pip install deepgram-sdk- Build agent_config with listen/think/speak sections
dg.agent.websocket.v('1').start(agent_config)+ send mic audio + play AudioOutput
Intro
The Deepgram Voice Agent API bundles Deepgram STT (Nova-3), your LLM of choice (Anthropic, OpenAI, Groq, AWS Bedrock), and Deepgram Aura TTS into one WebSocket connection. Send mic audio in, receive agent audio out — turn detection, barge-in, function calling all handled. Best for: voice agents that don't need component-level swap, fast launch, single-vendor billing. Works with: any WebSocket-capable platform; Python, JS, Go SDKs. Setup time: 15 minutes.
Set up the WebSocket agent
import asyncio, json
from deepgram import DeepgramClient
dg = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])
agent_config = {
"type": "SettingsConfiguration",
"audio": {
"input": {"encoding": "linear16", "sample_rate": 16000},
"output": {"encoding": "linear16", "sample_rate": 24000, "container": "none"},
},
"agent": {
"listen": {"model": "nova-3"},
"speak": {"model": "aura-2-luna-en"},
"think": {
"provider": {"type": "anthropic"},
"model": "claude-3-5-sonnet-20241022",
"instructions": "You are a friendly customer support agent for TokRepo. Keep replies under 2 sentences.",
},
},
}
async def run_agent():
agent = dg.agent.websocket.v("1")
await agent.start(agent_config)
async def on_audio_output(data: bytes):
# Play this on speakers (or send to WebRTC peer)
await play_audio(data)
agent.on("AudioOutput", on_audio_output)
agent.on("ConversationText", lambda role, content: print(f"{role}: {content}"))
agent.on("UserStartedSpeaking", lambda: print("user speaking — barge-in"))
# Feed mic audio
async for chunk in mic_audio():
await agent.send(chunk)
asyncio.run(run_agent())Function calling
agent_config["agent"]["think"]["functions"] = [{
"name": "lookup_order",
"description": "Look up an order by ID",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
},
}]
agent.on("FunctionCallRequest", lambda fn_call: handle_function(fn_call))LLM provider options
| Provider | Notes |
|---|---|
openai |
gpt-4o, gpt-4o-mini |
anthropic |
claude-3-5-sonnet, haiku |
groq |
Llama 3.3 70B at 280 tok/s — lowest latency |
aws_bedrock |
Bedrock-hosted models (good for regulated AWS shops) |
custom |
Any OpenAI-compatible endpoint |
Aura TTS voice cheat sheet
| Voice ID | Best for |
|---|---|
aura-2-luna-en |
Default — warm American female |
aura-2-stella-en |
Energetic, podcast-style |
aura-2-asteria-en |
Calm British female |
aura-2-orion-en |
Authoritative American male |
Voice Agent vs DIY pipeline
| Need | Choose |
|---|---|
| Ship fast, single vendor | Voice Agent API |
| Use any TTS / STT / LLM mix | DIY (LiveKit Agents) |
| Need ultra-low TTS latency | DIY with Cartesia TTS |
| Need open-weight LLM at low cost | DIY with Groq Llama 3.3 |
FAQ
Q: How is this different from ElevenLabs ConvAI? A: Both are managed voice agent APIs. Deepgram leans on its in-house STT strength and lets you pick the LLM; ElevenLabs leans on its TTS strength. If STT quality matters more (call centers, noisy audio) → Deepgram. If voice naturalness matters more (consumer brand) → ElevenLabs.
Q: Turn detection — how good?
A: Deepgram uses VAD + utterance-end signals (default 1000ms silence threshold). Tune endpointing for snappier (300ms) or more patient (2000ms) cutoffs. Aggressive endpointing risks chopping speech; conservative wastes time.
Q: Pricing model? A: Bundled per-minute billed at conversation duration. Roughly $0.08/min on standard configurations. Cheaper than DIY at low volume; DIY wins at high volume where you optimize per-component costs.
Source & Thanks
Built by Deepgram. Voice Agent docs at developers.deepgram.com/docs/voice-agent.