# Groq Whisper — Sub-Second Speech-to-Text for Voice Agents > Whisper-large-v3 on Groq runs 166× realtime — 60-sec clip in <400ms. OpenAI-compat audio.transcriptions endpoint for voice agents. ## Install Save as a script file and run: ## Quick Use 1. Get GROQ_API_KEY at console.groq.com 2. `client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))` 3. For real-time voice agents use `whisper-large-v3-turbo` --- ## Intro Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (`audio.transcriptions.create`) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes. --- ### Basic transcription ```python from openai import OpenAI client = OpenAI( base_url="https://api.groq.com/openai/v1", api_key=os.environ["GROQ_API_KEY"], ) with open("meeting.m4a", "rb") as f: transcript = client.audio.transcriptions.create( model="whisper-large-v3", file=f, response_format="verbose_json", # gives word timestamps timestamp_granularities=["word"], ) print(transcript.text) print(transcript.words[:5]) # [{word, start, end}] ``` ### Translation (any language → English) ```python translation = client.audio.translations.create( model="whisper-large-v3", file=open("japanese-clip.mp3", "rb"), ) print(translation.text) # English output ``` ### Streaming voice agent loop (LiveKit-style) ```python import asyncio from io import BytesIO async def transcribe_chunk(audio_bytes: bytes) -> str: f = BytesIO(audio_bytes); f.name = "chunk.wav" r = client.audio.transcriptions.create( model="whisper-large-v3-turbo", # ~216× realtime, slightly less accurate file=f, ) return r.text # Pipe VAD-segmented audio chunks to this function for live transcription ``` ### Performance characteristics | Metric | Value | |---|---| | Whisper-large-v3 speed | ~166× realtime | | Whisper-large-v3-turbo speed | ~216× realtime | | Max file size | 25 MB | | Supported formats | mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg | | Languages | 99 (full Whisper coverage) | | Pricing | $0.111 / hour of audio (large-v3), $0.04 / hour (turbo) | ### Voice-agent latency budget | Stage | Typical | Voice-friendly | |---|---|---| | VAD segment | 50–200ms | 100ms | | Whisper STT (Groq) | 300–500ms | 400ms | | LLM (Groq Llama 3.3) | 200–800ms | 500ms | | TTS (Cartesia / ElevenLabs) | 200–500ms | 350ms | | **Total round-trip** | | **~1,350ms** | --- ### FAQ **Q: Whisper-large-v3 vs turbo on Groq?** A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3. **Q: Can I get word-level timestamps?** A: Yes — `response_format='verbose_json'` and `timestamp_granularities=['word']`. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI. **Q: How does this compare to Deepgram Nova / AssemblyAI?** A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps. --- ## Source & Thanks > Built by [Groq](https://github.com/groq). Whisper docs at [console.groq.com/docs/speech-text](https://console.groq.com/docs/speech-text). > > Whisper weights MIT-licensed, hosted by Groq. --- ## 快速使用 1. 在 console.groq.com 拿 GROQ_API_KEY 2. `client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))` 3. 实时语音 agent 用 `whisper-large-v3-turbo` --- ## 简介 Groq LPU 上托管的 Whisper-large-v3 跑 ~166× 实时 —— 60 秒片段约 400ms 转完。Endpoint 跟 OpenAI 兼容(`audio.transcriptions.create`),任何调 OpenAI whisper-1 的代码改一个 URL 就过来。适合往返延迟必须 <1 秒的语音 agent、实时会议转录、语音控制的 agentic 流程。兼容 openai-python、openai-node、livekit-agents、vapi、deepgram 风格流水线。装机时间 5 分钟。 --- ### 基础转录 ```python from openai import OpenAI client = OpenAI( base_url="https://api.groq.com/openai/v1", api_key=os.environ["GROQ_API_KEY"], ) with open("meeting.m4a", "rb") as f: transcript = client.audio.transcriptions.create( model="whisper-large-v3", file=f, response_format="verbose_json", # 返回 word 时间戳 timestamp_granularities=["word"], ) print(transcript.text) print(transcript.words[:5]) # [{word, start, end}] ``` ### 翻译(任意语言 → 英文) ```python translation = client.audio.translations.create( model="whisper-large-v3", file=open("japanese-clip.mp3", "rb"), ) print(translation.text) # 英文输出 ``` ### 流式语音 agent 循环(LiveKit 风格) ```python import asyncio from io import BytesIO async def transcribe_chunk(audio_bytes: bytes) -> str: f = BytesIO(audio_bytes); f.name = "chunk.wav" r = client.audio.transcriptions.create( model="whisper-large-v3-turbo", # ~216× 实时,精度小降 file=f, ) return r.text # 把 VAD 切段后的音频块送进这个函数做实时转录 ``` ### 性能指标 | 指标 | 值 | |---|---| | Whisper-large-v3 速度 | ~166× 实时 | | Whisper-large-v3-turbo 速度 | ~216× 实时 | | 最大文件 | 25 MB | | 支持格式 | mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg | | 语言 | 99(覆盖完整 Whisper)| | 价格 | $0.111/音频小时(large-v3)、$0.04/小时(turbo)| ### 语音 agent 延迟预算 | 阶段 | 典型 | 语音友好 | |---|---|---| | VAD 切段 | 50–200ms | 100ms | | Whisper STT(Groq)| 300–500ms | 400ms | | LLM(Groq Llama 3.3)| 200–800ms | 500ms | | TTS(Cartesia / ElevenLabs)| 200–500ms | 350ms | | **总往返** | | **~1,350ms** | --- ### FAQ **Q: Whisper-large-v3 vs turbo 在 Groq 上?** A: v3 更准,尤其口音和噪声。Turbo 砍一层 decoding 换 ~30% 速度,难音频上 WER 高 ~5%。实时语音用 turbo,会议存档用 v3。 **Q: 能拿词级时间戳吗?** A: 能 —— `response_format='verbose_json'` + `timestamp_granularities=['word']`。每词返回 start/end 秒。用来做字幕对齐、agent 记忆锚定、点词跳转 UI。 **Q: 跟 Deepgram Nova / AssemblyAI 比?** A: Deepgram Nova 专为流式做的,部分结果延迟 <300ms。Groq 上的 Whisper 在多语言和口音语音更准。英文呼叫中心选 Deepgram,全球语音应用选 Groq Whisper。 --- ## 来源与感谢 > Built by [Groq](https://github.com/groq). Whisper docs at [console.groq.com/docs/speech-text](https://console.groq.com/docs/speech-text). > > Whisper weights MIT-licensed, hosted by Groq. --- Source: https://tokrepo.com/en/workflows/groq-whisper-sub-second-speech-to-text-for-voice-agents Author: Groq