Skills2026年3月29日·1 分钟阅读

Whisper — OpenAI Speech-to-Text

OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.

OpenAI · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Community

入口

Whisper — OpenAI Speech-to-Text

直接安装命令

npx -y tokrepo@latest install eb0f9dd6-2172-4c9f-aca9-97846b0f4d86 --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

OpenAI Whisper transcribes audio to text in 99 languages with word-level timestamps, running locally without API calls.

§01

What it is

Whisper is OpenAI's open-source speech recognition model. It transcribes audio and video files to text with high accuracy across 99 languages. The model runs locally, requires no API key, and produces output in plain text, SRT subtitles, VTT, JSON, or TSV formats.

Whisper targets developers, content creators, and researchers who need reliable transcription without sending audio to cloud services. Multiple model sizes (tiny to large) trade accuracy for speed.

§02

How it saves time or tokens

Manual transcription takes 4-6x the audio duration. Whisper transcribes a 1-hour podcast in minutes on a modern GPU or 15-30 minutes on CPU. The output includes word-level timestamps, making it directly usable for subtitle generation.

For AI workflows, Whisper converts audio content into text that LLMs can process. Meeting recordings, podcast episodes, and lecture videos become searchable, summarizable text.

§03

How to use

Install Whisper:

pip install openai-whisper

Transcribe audio from the command line:

whisper audio.mp3 --model medium --language en --output_format srt

Use the Python API for programmatic access:

import whisper

model = whisper.load_model('medium')
result = model.transcribe('audio.mp3')
print(result['text'])

§04

Example

import whisper

model = whisper.load_model('medium')

# Transcribe with word-level timestamps
result = model.transcribe(
    'meeting.mp3',
    word_timestamps=True,
    language='en'
)

# Access segments with timestamps
for segment in result['segments']:
    print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")

# Output:
# [0.0s - 3.2s] Welcome to the quarterly review.
# [3.2s - 7.8s] Let us start with the revenue numbers.

§05

Related on TokRepo

AI Tools for Voice -- Speech synthesis and recognition tools
AI Tools for Content -- Content creation and processing tools

§06

Common pitfalls

The large model requires 10GB+ of VRAM. Use the medium or small model if GPU memory is limited. CPU inference works but is significantly slower.
Whisper hallucinates on silent audio segments, sometimes generating repetitive or nonsensical text. Pre-process audio to trim silence.
Non-English transcription accuracy varies by language. Languages with less training data produce lower quality output.

常见问题

Which Whisper model should I use?+

Use 'tiny' or 'base' for quick drafts (fastest, lowest accuracy). Use 'medium' for a good balance of speed and quality. Use 'large' for the highest accuracy, especially for non-English languages or noisy audio.

Does Whisper require a GPU?+

No, but a GPU dramatically improves speed. CPU inference works for all model sizes but is 5-10x slower. A CUDA-compatible NVIDIA GPU is recommended for production use.

Can Whisper do real-time transcription?+

Whisper is designed for batch transcription of recorded audio. Real-time streaming is possible with community forks like faster-whisper or whisper.cpp, which optimize for lower latency.

Is Whisper free to use?+

Yes. The model weights and code are open-source under MIT license. Running Whisper locally is completely free. OpenAI also offers a paid Whisper API for cloud-based transcription.

What audio formats does Whisper support?+

Whisper supports any format that ffmpeg can decode: MP3, WAV, M4A, FLAC, OGG, MP4, MKV, and more. ffmpeg is a required dependency.

引用来源 (3)

Whisper GitHub Repository— Whisper is an open-source speech recognition model by OpenAI
Whisper Paper— Supports 99 languages with multiple model sizes
Whisper README— MIT licensed for free local use

🙏

来源与感谢

Created by OpenAI. Licensed under MIT. whisper — ⭐ 75,000+

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Whisper — OpenAI Speech-to-Text

Agent 可直接安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

Faster Whisper — 4x Faster Speech-to-Text

whisper.cpp — Local Speech-to-Text in Pure C/C++

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

SenseVoice — Multilingual Speech Understanding Model