# Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

> Whisper-large-v3 on Groq runs 166× realtime — 60-sec clip in <400ms. OpenAI-compat audio.transcriptions endpoint for voice agents.

## Install

Save as a script file and run:

## Quick Use

1. Get GROQ_API_KEY at console.groq.com
2. `client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))`
3. For real-time voice agents use `whisper-large-v3-turbo`

---

## Intro

Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (`audio.transcriptions.create`) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes.

---

### Basic transcription

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # gives word timestamps
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]
```

### Translation (any language → English)

```python
translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # English output
```

### Streaming voice agent loop (LiveKit-style)

```python
import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× realtime, slightly less accurate
        file=f,
    )
    return r.text

# Pipe VAD-segmented audio chunks to this function for live transcription
```

### Performance characteristics

| Metric | Value |
|---|---|
| Whisper-large-v3 speed | ~166× realtime |
| Whisper-large-v3-turbo speed | ~216× realtime |
| Max file size | 25 MB |
| Supported formats | mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg |
| Languages | 99 (full Whisper coverage) |
| Pricing | $0.111 / hour of audio (large-v3), $0.04 / hour (turbo) |

### Voice-agent latency budget

| Stage | Typical | Voice-friendly |
|---|---|---|
| VAD segment | 50–200ms | 100ms |
| Whisper STT (Groq) | 300–500ms | 400ms |
| LLM (Groq Llama 3.3) | 200–800ms | 500ms |
| TTS (Cartesia / ElevenLabs) | 200–500ms | 350ms |
| **Total round-trip** | | **~1,350ms** |

---

### FAQ

**Q: Whisper-large-v3 vs turbo on Groq?**
A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3.

**Q: Can I get word-level timestamps?**
A: Yes — `response_format='verbose_json'` and `timestamp_granularities=['word']`. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI.

**Q: How does this compare to Deepgram Nova / AssemblyAI?**
A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps.

---

## Source & Thanks

> Built by [Groq](https://github.com/groq). Whisper docs at [console.groq.com/docs/speech-text](https://console.groq.com/docs/speech-text).
>
> Whisper weights MIT-licensed, hosted by Groq.

---

<!-- ZH -->

## 快速使用

1. 在 console.groq.com 拿 GROQ_API_KEY
2. `client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))`
3. 实时语音 agent 用 `whisper-large-v3-turbo`

---

## 简介

Groq LPU 上托管的 Whisper-large-v3 跑 ~166× 实时 —— 60 秒片段约 400ms 转完。Endpoint 跟 OpenAI 兼容（`audio.transcriptions.create`），任何调 OpenAI whisper-1 的代码改一个 URL 就过来。适合往返延迟必须 <1 秒的语音 agent、实时会议转录、语音控制的 agentic 流程。兼容 openai-python、openai-node、livekit-agents、vapi、deepgram 风格流水线。装机时间 5 分钟。

---

### 基础转录

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # 返回 word 时间戳
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]
```

### 翻译（任意语言 → 英文）

```python
translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # 英文输出
```

### 流式语音 agent 循环（LiveKit 风格）

```python
import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× 实时，精度小降
        file=f,
    )
    return r.text

# 把 VAD 切段后的音频块送进这个函数做实时转录
```

### 性能指标

| 指标 | 值 |
|---|---|
| Whisper-large-v3 速度 | ~166× 实时 |
| Whisper-large-v3-turbo 速度 | ~216× 实时 |
| 最大文件 | 25 MB |
| 支持格式 | mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg |
| 语言 | 99（覆盖完整 Whisper）|
| 价格 | $0.111/音频小时（large-v3）、$0.04/小时（turbo）|

### 语音 agent 延迟预算

| 阶段 | 典型 | 语音友好 |
|---|---|---|
| VAD 切段 | 50–200ms | 100ms |
| Whisper STT（Groq）| 300–500ms | 400ms |
| LLM（Groq Llama 3.3）| 200–800ms | 500ms |
| TTS（Cartesia / ElevenLabs）| 200–500ms | 350ms |
| **总往返** | | **~1,350ms** |

---

### FAQ

**Q: Whisper-large-v3 vs turbo 在 Groq 上？**
A: v3 更准，尤其口音和噪声。Turbo 砍一层 decoding 换 ~30% 速度，难音频上 WER 高 ~5%。实时语音用 turbo，会议存档用 v3。

**Q: 能拿词级时间戳吗？**
A: 能 —— `response_format='verbose_json'` + `timestamp_granularities=['word']`。每词返回 start/end 秒。用来做字幕对齐、agent 记忆锚定、点词跳转 UI。

**Q: 跟 Deepgram Nova / AssemblyAI 比？**
A: Deepgram Nova 专为流式做的，部分结果延迟 <300ms。Groq 上的 Whisper 在多语言和口音语音更准。英文呼叫中心选 Deepgram，全球语音应用选 Groq Whisper。

---

## 来源与感谢

> Built by [Groq](https://github.com/groq). Whisper docs at [console.groq.com/docs/speech-text](https://console.groq.com/docs/speech-text).
>
> Whisper weights MIT-licensed, hosted by Groq.


---
Source: https://tokrepo.com/en/workflows/groq-whisper-sub-second-speech-to-text-for-voice-agents
Author: Groq