Models
| Model | Parameters | Speed | Accuracy | VRAM |
|---|---|---|---|---|
| tiny | 39M | ~10x realtime | Good | ~1GB |
| base | 74M | ~7x realtime | Better | ~1GB |
| small | 244M | ~4x realtime | Good+ | ~2GB |
| medium | 769M | ~2x realtime | Great | ~5GB |
| large-v3 | 1.5B | ~1x realtime | Best | ~10GB |
Python API
import whisper
model = whisper.load_model("medium")
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")Output Formats
whisper audio.mp3 --output_format srt # SubRip subtitles
whisper audio.mp3 --output_format vtt # WebVTT subtitles
whisper audio.mp3 --output_format json # Detailed JSON with word timestamps
whisper audio.mp3 --output_format txt # Plain textFAQ
Q: What is Whisper? A: OpenAI's open-source speech recognition model that transcribes audio to text in 99 languages with word-level timestamps. 75,000+ GitHub stars.
Q: Is Whisper free? A: Yes. Whisper is MIT-licensed and runs locally on your machine. No API costs.
Q: What languages does Whisper support? A: 99 languages including English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, and more.