How do I install Groq Whisper — Sub-Second Speech-to-Text for Voice Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

from openai import OpenAI client = OpenAI( base_url="https://api.groq.com/openai/v1", api_key=os.environ["GROQ_API_KEY"], ) with open("meeting.m4a", "rb") as f: transcript = client.audio.transcriptions.create( model="whisper-large-v3", file=f, response_format="verbose_json", # gives word timestamps timestamp_granularities=["word"], ) print(transcript.text) print(transcript.words[:5]) # [{word, start, end}]

import asyncio from io import BytesIO async def transcribe_chunk(audio_bytes: bytes) -> str: f = BytesIO(audio_bytes); f.name = "chunk.wav" r = client.audio.transcriptions.create( model="whisper-large-v3-turbo", # ~216× realtime, slightly less accurate file=f, ) return r.text # Pipe VAD-segmented audio chunks to this function for live transcription

Metric

Value

Whisper-large-v3 speed

~166× realtime

Whisper-large-v3-turbo speed

~216× realtime

Max file size

25 MB

Supported formats

mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg

Languages

99 (full Whisper coverage)

Pricing

$0.111 / hour of audio (large-v3), $0.04 / hour (turbo)

Stage

Typical

Voice-friendly

VAD segment

50–200ms

100ms

Whisper STT (Groq)

300–500ms

400ms

LLM (Groq Llama 3.3)

200–800ms

500ms

TTS (Cartesia / ElevenLabs)

200–500ms

350ms

Total round-trip

~1,350ms

Quick Use

Get GROQ_API_KEY at console.groq.com
client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))
For real-time voice agents use whisper-large-v3-turbo

Intro

Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (audio.transcriptions.create) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes.

Basic transcription

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # gives word timestamps
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]

Translation (any language → English)

translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # English output

Streaming voice agent loop (LiveKit-style)

import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× realtime, slightly less accurate
        file=f,
    )
    return r.text

# Pipe VAD-segmented audio chunks to this function for live transcription

Performance characteristics

Metric	Value
Whisper-large-v3 speed	~166× realtime
Whisper-large-v3-turbo speed	~216× realtime
Max file size	25 MB
Supported formats	mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg
Languages	99 (full Whisper coverage)
Pricing	$0.111 / hour of audio (large-v3), $0.04 / hour (turbo)

Voice-agent latency budget

Stage	Typical	Voice-friendly
VAD segment	50–200ms	100ms
Whisper STT (Groq)	300–500ms	400ms
LLM (Groq Llama 3.3)	200–800ms	500ms
TTS (Cartesia / ElevenLabs)	200–500ms	350ms
Total round-trip		~1,350ms

FAQ

Q: Whisper-large-v3 vs turbo on Groq? A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3.

Q: Can I get word-level timestamps? A: Yes — response_format='verbose_json' and timestamp_granularities=['word']. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI.

Q: How does this compare to Deepgram Nova / AssemblyAI? A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps.

Source & Thanks

Built by Groq. Whisper docs at console.groq.com/docs/speech-text.

Whisper weights MIT-licensed, hosted by Groq.

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

Staging sûr pour cet actif

Basic transcription

Translation (any language → English)

Streaming voice agent loop (LiveKit-style)

Performance characteristics

Voice-agent latency budget

FAQ

Quick Use

Intro

Basic transcription

Translation (any language → English)

Streaming voice agent loop (LiveKit-style)

Performance characteristics

Voice-agent latency budget

FAQ

Source & Thanks

Source et remerciements

Fil de discussion

Actifs similaires

Whisper — OpenAI Speech-to-Text

whisper.cpp — Local Speech-to-Text in Pure C/C++

Faster Whisper — 4x Faster Speech-to-Text

Groq Tool Use — Llama 3.3 Function Calling at 280 tok/s