Quick Use
- Get GROQ_API_KEY at console.groq.com
client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))- For real-time voice agents use
whisper-large-v3-turbo
Intro
Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (audio.transcriptions.create) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes.
Basic transcription
from openai import OpenAI
client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key=os.environ["GROQ_API_KEY"],
)
with open("meeting.m4a", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=f,
response_format="verbose_json", # gives word timestamps
timestamp_granularities=["word"],
)
print(transcript.text)
print(transcript.words[:5]) # [{word, start, end}]Translation (any language → English)
translation = client.audio.translations.create(
model="whisper-large-v3",
file=open("japanese-clip.mp3", "rb"),
)
print(translation.text) # English outputStreaming voice agent loop (LiveKit-style)
import asyncio
from io import BytesIO
async def transcribe_chunk(audio_bytes: bytes) -> str:
f = BytesIO(audio_bytes); f.name = "chunk.wav"
r = client.audio.transcriptions.create(
model="whisper-large-v3-turbo", # ~216× realtime, slightly less accurate
file=f,
)
return r.text
# Pipe VAD-segmented audio chunks to this function for live transcriptionPerformance characteristics
| Metric | Value |
|---|---|
| Whisper-large-v3 speed | ~166× realtime |
| Whisper-large-v3-turbo speed | ~216× realtime |
| Max file size | 25 MB |
| Supported formats | mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg |
| Languages | 99 (full Whisper coverage) |
| Pricing | $0.111 / hour of audio (large-v3), $0.04 / hour (turbo) |
Voice-agent latency budget
| Stage | Typical | Voice-friendly |
|---|---|---|
| VAD segment | 50–200ms | 100ms |
| Whisper STT (Groq) | 300–500ms | 400ms |
| LLM (Groq Llama 3.3) | 200–800ms | 500ms |
| TTS (Cartesia / ElevenLabs) | 200–500ms | 350ms |
| Total round-trip | ~1,350ms |
FAQ
Q: Whisper-large-v3 vs turbo on Groq? A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3.
Q: Can I get word-level timestamps?
A: Yes — response_format='verbose_json' and timestamp_granularities=['word']. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI.
Q: How does this compare to Deepgram Nova / AssemblyAI? A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps.
Source & Thanks
Built by Groq. Whisper docs at console.groq.com/docs/speech-text.
Whisper weights MIT-licensed, hosted by Groq.