How do I install Groq Whisper — Sub-Second Speech-to-Text for Voice Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

Name: Groq Whisper — Sub-Second Speech-to-Text for Voice Agents
Author: Groq

简介

Groq LPU 上托管的 Whisper-large-v3 跑 ~166× 实时 —— 60 秒片段约 400ms 转完。Endpoint 跟 OpenAI 兼容（audio.transcriptions.create），任何调 OpenAI whisper-1 的代码改一个 URL 就过来。适合往返延迟必须 <1 秒的语音 agent、实时会议转录、语音控制的 agentic 流程。兼容 openai-python、openai-node、livekit-agents、vapi、deepgram 风格流水线。装机时间 5 分钟。

基础转录

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json",  # 返回 word 时间戳
        timestamp_granularities=["word"],
    )

print(transcript.text)
print(transcript.words[:5])  # [{word, start, end}]

翻译（任意语言 → 英文）

translation = client.audio.translations.create(
    model="whisper-large-v3",
    file=open("japanese-clip.mp3", "rb"),
)
print(translation.text)  # 英文输出

流式语音 agent 循环（LiveKit 风格）

import asyncio
from io import BytesIO

async def transcribe_chunk(audio_bytes: bytes) -> str:
    f = BytesIO(audio_bytes); f.name = "chunk.wav"
    r = client.audio.transcriptions.create(
        model="whisper-large-v3-turbo",   # ~216× 实时，精度小降
        file=f,
    )
    return r.text

# 把 VAD 切段后的音频块送进这个函数做实时转录

性能指标

指标	值
Whisper-large-v3 速度	~166× 实时
Whisper-large-v3-turbo 速度	~216× 实时
最大文件	25 MB
支持格式	mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg
语言	99（覆盖完整 Whisper）
价格	$0.111/音频小时（large-v3）、$0.04/小时（turbo）

语音 agent 延迟预算

阶段	典型	语音友好
VAD 切段	50–200ms	100ms
Whisper STT（Groq）	300–500ms	400ms
LLM（Groq Llama 3.3）	200–800ms	500ms
TTS（Cartesia / ElevenLabs）	200–500ms	350ms
总往返		~1,350ms

FAQ

Q: Whisper-large-v3 vs turbo 在 Groq 上？ A: v3 更准，尤其口音和噪声。Turbo 砍一层 decoding 换 ~30% 速度，难音频上 WER 高 ~5%。实时语音用 turbo，会议存档用 v3。

Q: 能拿词级时间戳吗？ A: 能 —— response_format='verbose_json' + timestamp_granularities=['word']。每词返回 start/end 秒。用来做字幕对齐、agent 记忆锚定、点词跳转 UI。

Q: 跟 Deepgram Nova / AssemblyAI 比？ A: Deepgram Nova 专为流式做的，部分结果延迟 <300ms。Groq 上的 Whisper 在多语言和口音语音更准。英文呼叫中心选 Deepgram，全球语音应用选 Groq Whisper。

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

这个资产可以被 Agent 直接读取和安装

简介

基础转录

翻译（任意语言 → 英文）

流式语音 agent 循环（LiveKit 风格）

性能指标

语音 agent 延迟预算

FAQ

来源与感谢

讨论

相关资产

Whisper — OpenAI Speech-to-Text

whisper.cpp — Local Speech-to-Text in Pure C/C++

Faster Whisper — 4x Faster Speech-to-Text

LiveKit Agents — Python Framework for Voice AI