Whisper — OpenAI Speech-to-Text
OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.
What it is
Whisper is OpenAI's open-source speech recognition model. It transcribes audio and video files to text with high accuracy across 99 languages. The model runs locally, requires no API key, and produces output in plain text, SRT subtitles, VTT, JSON, or TSV formats.
Whisper targets developers, content creators, and researchers who need reliable transcription without sending audio to cloud services. Multiple model sizes (tiny to large) trade accuracy for speed.
How it saves time or tokens
Manual transcription takes 4-6x the audio duration. Whisper transcribes a 1-hour podcast in minutes on a modern GPU or 15-30 minutes on CPU. The output includes word-level timestamps, making it directly usable for subtitle generation.
For AI workflows, Whisper converts audio content into text that LLMs can process. Meeting recordings, podcast episodes, and lecture videos become searchable, summarizable text.
How to use
- Install Whisper:
pip install openai-whisper
- Transcribe audio from the command line:
whisper audio.mp3 --model medium --language en --output_format srt
- Use the Python API for programmatic access:
import whisper
model = whisper.load_model('medium')
result = model.transcribe('audio.mp3')
print(result['text'])
Example
import whisper
model = whisper.load_model('medium')
# Transcribe with word-level timestamps
result = model.transcribe(
'meeting.mp3',
word_timestamps=True,
language='en'
)
# Access segments with timestamps
for segment in result['segments']:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")
# Output:
# [0.0s - 3.2s] Welcome to the quarterly review.
# [3.2s - 7.8s] Let us start with the revenue numbers.
Related on TokRepo
- AI Tools for Voice -- Speech synthesis and recognition tools
- AI Tools for Content -- Content creation and processing tools
Common pitfalls
- The large model requires 10GB+ of VRAM. Use the medium or small model if GPU memory is limited. CPU inference works but is significantly slower.
- Whisper hallucinates on silent audio segments, sometimes generating repetitive or nonsensical text. Pre-process audio to trim silence.
- Non-English transcription accuracy varies by language. Languages with less training data produce lower quality output.
Frequently Asked Questions
Use 'tiny' or 'base' for quick drafts (fastest, lowest accuracy). Use 'medium' for a good balance of speed and quality. Use 'large' for the highest accuracy, especially for non-English languages or noisy audio.
No, but a GPU dramatically improves speed. CPU inference works for all model sizes but is 5-10x slower. A CUDA-compatible NVIDIA GPU is recommended for production use.
Whisper is designed for batch transcription of recorded audio. Real-time streaming is possible with community forks like faster-whisper or whisper.cpp, which optimize for lower latency.
Yes. The model weights and code are open-source under MIT license. Running Whisper locally is completely free. OpenAI also offers a paid Whisper API for cloud-based transcription.
Whisper supports any format that ffmpeg can decode: MP3, WAV, M4A, FLAC, OGG, MP4, MKV, and more. ffmpeg is a required dependency.
Citations (3)
- Whisper GitHub Repository— Whisper is an open-source speech recognition model by OpenAI
- Whisper Paper— Supports 99 languages with multiple model sizes
- Whisper README— MIT licensed for free local use
Related on TokRepo
Source & Thanks
Discussion
Related Assets
Kornia — Differentiable Computer Vision Library for PyTorch
Kornia is a differentiable computer vision library built on PyTorch that provides GPU-accelerated implementations of classical vision algorithms including geometric transforms, color conversions, filtering, feature detection, and augmentations, all with full autograd support for end-to-end learning.
AlphaFold — AI-Powered 3D Protein Structure Prediction
AlphaFold by Google DeepMind predicts three-dimensional protein structures from amino acid sequences with atomic-level accuracy, enabling breakthroughs in drug discovery, enzyme engineering, and structural biology research.
Flash Attention — Fast Memory-Efficient Exact Attention for Transformers
Flash Attention is a CUDA kernel library that computes exact scaled dot-product attention 2-4x faster and with up to 20x less memory than standard implementations by using IO-aware tiling to minimize GPU memory reads and writes.