Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 8, 2026·5 min de lecture

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

Whisper-large-v3 on Groq runs 166× realtime — 60-sec clip in <400ms. OpenAI-compat audio.transcriptions endpoint for voice agents.

Groq
Groq · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 17/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 34b19e7a-a7a9-4869-9339-edbd8a20144f
Introduction

Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (audio.transcriptions.create) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes.


Basic transcription

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # gives word timestamps
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]

Translation (any language → English)

translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # English output

Streaming voice agent loop (LiveKit-style)

import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× realtime, slightly less accurate
        file=f,
    )
    return r.text

# Pipe VAD-segmented audio chunks to this function for live transcription

Performance characteristics

Metric Value
Whisper-large-v3 speed ~166× realtime
Whisper-large-v3-turbo speed ~216× realtime
Max file size 25 MB
Supported formats mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg
Languages 99 (full Whisper coverage)
Pricing $0.111 / hour of audio (large-v3), $0.04 / hour (turbo)

Voice-agent latency budget

Stage Typical Voice-friendly
VAD segment 50–200ms 100ms
Whisper STT (Groq) 300–500ms 400ms
LLM (Groq Llama 3.3) 200–800ms 500ms
TTS (Cartesia / ElevenLabs) 200–500ms 350ms
Total round-trip ~1,350ms

FAQ

Q: Whisper-large-v3 vs turbo on Groq? A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3.

Q: Can I get word-level timestamps? A: Yes — response_format='verbose_json' and timestamp_granularities=['word']. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI.

Q: How does this compare to Deepgram Nova / AssemblyAI? A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps.


Quick Use

  1. Get GROQ_API_KEY at console.groq.com
  2. client.audio.transcriptions.create(model='whisper-large-v3', file=open(path,'rb'))
  3. For real-time voice agents use whisper-large-v3-turbo

Intro

Whisper-large-v3 hosted on Groq's LPU runs at ~166× realtime — a 60-second clip transcribes in roughly 400ms. The endpoint is OpenAI-compatible (audio.transcriptions.create) so any code targeting OpenAI's whisper-1 swaps over with one URL change. Best for: voice agents where round-trip latency must stay under 1 second, real-time meeting transcription, voice-controlled agentic flows. Works with: openai-python, openai-node, livekit-agents, vapi, deepgram-style pipelines. Setup time: 5 minutes.


Basic transcription

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # gives word timestamps
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]

Translation (any language → English)

translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # English output

Streaming voice agent loop (LiveKit-style)

import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× realtime, slightly less accurate
        file=f,
    )
    return r.text

# Pipe VAD-segmented audio chunks to this function for live transcription

Performance characteristics

Metric Value
Whisper-large-v3 speed ~166× realtime
Whisper-large-v3-turbo speed ~216× realtime
Max file size 25 MB
Supported formats mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg
Languages 99 (full Whisper coverage)
Pricing $0.111 / hour of audio (large-v3), $0.04 / hour (turbo)

Voice-agent latency budget

Stage Typical Voice-friendly
VAD segment 50–200ms 100ms
Whisper STT (Groq) 300–500ms 400ms
LLM (Groq Llama 3.3) 200–800ms 500ms
TTS (Cartesia / ElevenLabs) 200–500ms 350ms
Total round-trip ~1,350ms

FAQ

Q: Whisper-large-v3 vs turbo on Groq? A: v3 is more accurate especially on accents and noise. Turbo trims a decoding layer for ~30% speed gain at ~5% WER increase on hard audio. For real-time voice → turbo. For meeting archives → v3.

Q: Can I get word-level timestamps? A: Yes — response_format='verbose_json' and timestamp_granularities=['word']. Returns each word with start/end seconds. Useful for caption alignment, agent memory anchoring, scrub-to-word UI.

Q: How does this compare to Deepgram Nova / AssemblyAI? A: Deepgram Nova is purpose-built and faster on streaming (sub-300ms partial results). Whisper on Groq is more accurate on multilingual and accented speech. Pick Deepgram for English call centers, Groq Whisper for global voice apps.


Source & Thanks

Built by Groq. Whisper docs at console.groq.com/docs/speech-text.

Whisper weights MIT-licensed, hosted by Groq.

🙏

Source et remerciements

Built by Groq. Whisper docs at console.groq.com/docs/speech-text.

Whisper weights MIT-licensed, hosted by Groq.

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires