WhisperX — 70x Faster Speech Recognition
WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.
What it is
WhisperX is a speech recognition system that accelerates OpenAI's Whisper model to 70x realtime speed through batched inference. It adds word-level timestamps via forced alignment and speaker diarization to identify who said what. The project requires under 8GB of VRAM and is licensed under BSD-2-Clause.
Researchers, podcast producers, and developers building transcription pipelines will find WhisperX useful when standard Whisper is too slow or when per-word timing and speaker labels are required.
How it saves time or tokens
Standard Whisper processes audio sequentially, making long recordings slow to transcribe. WhisperX batches audio segments and processes them in parallel, cutting transcription time drastically. The word-level alignment and diarization run as post-processing steps, so you get richer output without re-running the model.
How to use
- Install WhisperX with pip and ensure you have a CUDA-capable GPU with at least 8GB VRAM.
- Run the CLI command with your audio file and desired output format.
- Optionally enable speaker diarization by providing a HuggingFace token for the pyannote models.
Example
# Basic transcription with word-level timestamps
whisperx audio.mp3 --model large-v2 --output_dir ./output
# With speaker diarization
whisperx audio.mp3 --model large-v2 --diarize \
--hf_token YOUR_HF_TOKEN --output_dir ./output
# Specify language and output format
whisperx audio.mp3 --model large-v2 --language en \
--output_format srt --output_dir ./output
Related on TokRepo
- AI tools for voice — Other voice and speech processing tools for AI workflows.
- AI tools for automation — Automate transcription pipelines and audio processing tasks.
Common pitfalls
- Running without a CUDA GPU. WhisperX's speed advantage comes from GPU batching; CPU-only mode is significantly slower.
- Forgetting the HuggingFace token for diarization. The pyannote speaker diarization models require authentication through HuggingFace.
- Using the wrong model size for your VRAM. The large-v2 model needs close to 8GB; smaller GPUs should use the medium or small variants.
Frequently Asked Questions
WhisperX achieves 70x realtime speed through batched inference. A one-hour audio file that takes an hour with standard Whisper can be transcribed in under a minute with WhisperX on a capable GPU.
After transcription, WhisperX uses forced alignment (via wav2vec2) to map each word to its exact start and end time in the audio. This is more precise than Whisper's default segment-level timestamps.
Yes. WhisperX inherits Whisper's multilingual support. You can specify the language with the --language flag or let it auto-detect. Forced alignment models are available for many languages.
A CUDA-capable GPU with at least 8GB VRAM is recommended for the large-v2 model. Smaller models (medium, small, base) work on GPUs with less VRAM. CPU-only mode works but loses the speed advantage.
WhisperX is designed for batch processing of recorded audio files. It is not optimized for real-time streaming. For live transcription, consider streaming-focused alternatives.
Citations (3)
- WhisperX GitHub— WhisperX achieves 70x realtime with batched inference
- OpenAI Whisper GitHub— OpenAI Whisper model for speech recognition
- HuggingFace pyannote— Pyannote speaker diarization models
Related on TokRepo
Source & Thanks
Created by Max Bain. Licensed under BSD-2-Clause. m-bain/whisperX — 21,000+ GitHub stars
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.