ScriptsMay 8, 2026·5 min read

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

Whisper-large-v3 on Groq runs 166× realtime — 60-sec clip in <400ms. OpenAI-compat audio.transcriptions endpoint for voice agents.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 17/100Stage only
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Stage only
Trust
Trust: New
Entrypoint
Asset
Universal CLI install command
npx tokrepo install 34b19e7a-a7a9-4869-9339-edbd8a20144f
Intro

Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (audio.transcriptions.create) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes.


Basic transcription

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # gives word timestamps
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]

Translation (any language → English)

translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # English output

Streaming voice agent loop (LiveKit-style)

import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× realtime, slightly less accurate
        file=f,
    )
    return r.text

# Pipe VAD-segmented audio chunks to this function for live transcription

Performance characteristics

Metric Value
Whisper-large-v3 speed ~166× realtime
Whisper-large-v3-turbo speed ~216× realtime
Max file size 25 MB
Supported formats mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg
Languages 99 (full Whisper coverage)
Pricing $0.111 / hour of audio (large-v3), $0.04 / hour (turbo)

Voice-agent latency budget

Stage Typical Voice-friendly
VAD segment 50–200ms 100ms
Whisper STT (Groq) 300–500ms 400ms
LLM (Groq Llama 3.3) 200–800ms 500ms
TTS (Cartesia / ElevenLabs) 200–500ms 350ms
Total round-trip ~1,350ms

FAQ

Q: Whisper-large-v3 vs turbo on Groq? A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3.

Q: Can I get word-level timestamps? A: Yes — response_format='verbose_json' and timestamp_granularities=['word']. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI.

Q: How does this compare to Deepgram Nova / AssemblyAI? A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps.


Quick Use

  1. Get GROQ_API_KEY at console.groq.com
  2. client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))
  3. For real-time voice agents use whisper-large-v3-turbo

Intro

Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (audio.transcriptions.create) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes.


Basic transcription

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # gives word timestamps
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]

Translation (any language → English)

translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # English output

Streaming voice agent loop (LiveKit-style)

import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× realtime, slightly less accurate
        file=f,
    )
    return r.text

# Pipe VAD-segmented audio chunks to this function for live transcription

Performance characteristics

Metric Value
Whisper-large-v3 speed ~166× realtime
Whisper-large-v3-turbo speed ~216× realtime
Max file size 25 MB
Supported formats mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg
Languages 99 (full Whisper coverage)
Pricing $0.111 / hour of audio (large-v3), $0.04 / hour (turbo)

Voice-agent latency budget

Stage Typical Voice-friendly
VAD segment 50–200ms 100ms
Whisper STT (Groq) 300–500ms 400ms
LLM (Groq Llama 3.3) 200–800ms 500ms
TTS (Cartesia / ElevenLabs) 200–500ms 350ms
Total round-trip ~1,350ms

FAQ

Q: Whisper-large-v3 vs turbo on Groq? A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3.

Q: Can I get word-level timestamps? A: Yes — response_format='verbose_json' and timestamp_granularities=['word']. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI.

Q: How does this compare to Deepgram Nova / AssemblyAI? A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps.


Source & Thanks

Built by Groq. Whisper docs at console.groq.com/docs/speech-text.

Whisper weights MIT-licensed, hosted by Groq.

🙏

Source & Thanks

Built by Groq. Whisper docs at console.groq.com/docs/speech-text.

Whisper weights MIT-licensed, hosted by Groq.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets