ScriptsMar 31, 2026·2 min read

WhisperX — 70x Faster Speech Recognition

WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.

TL;DR
WhisperX runs Whisper at 70x realtime speed with word-level timestamps and speaker diarization.
§01

What it is

WhisperX is a speech recognition system that accelerates OpenAI's Whisper model to 70x realtime speed through batched inference. It adds word-level timestamps via forced alignment and speaker diarization to identify who said what. The project requires under 8GB of VRAM and is licensed under BSD-2-Clause.

Researchers, podcast producers, and developers building transcription pipelines will find WhisperX useful when standard Whisper is too slow or when per-word timing and speaker labels are required.

§02

How it saves time or tokens

Standard Whisper processes audio sequentially, making long recordings slow to transcribe. WhisperX batches audio segments and processes them in parallel, cutting transcription time drastically. The word-level alignment and diarization run as post-processing steps, so you get richer output without re-running the model.

§03

How to use

  1. Install WhisperX with pip and ensure you have a CUDA-capable GPU with at least 8GB VRAM.
  2. Run the CLI command with your audio file and desired output format.
  3. Optionally enable speaker diarization by providing a HuggingFace token for the pyannote models.
§04

Example

# Basic transcription with word-level timestamps
whisperx audio.mp3 --model large-v2 --output_dir ./output

# With speaker diarization
whisperx audio.mp3 --model large-v2 --diarize \
  --hf_token YOUR_HF_TOKEN --output_dir ./output

# Specify language and output format
whisperx audio.mp3 --model large-v2 --language en \
  --output_format srt --output_dir ./output
§05

Related on TokRepo

§06

Common pitfalls

  • Running without a CUDA GPU. WhisperX's speed advantage comes from GPU batching; CPU-only mode is significantly slower.
  • Forgetting the HuggingFace token for diarization. The pyannote speaker diarization models require authentication through HuggingFace.
  • Using the wrong model size for your VRAM. The large-v2 model needs close to 8GB; smaller GPUs should use the medium or small variants.

Frequently Asked Questions

How much faster is WhisperX compared to standard Whisper?+

WhisperX achieves 70x realtime speed through batched inference. A one-hour audio file that takes an hour with standard Whisper can be transcribed in under a minute with WhisperX on a capable GPU.

What is word-level timestamp alignment?+

After transcription, WhisperX uses forced alignment (via wav2vec2) to map each word to its exact start and end time in the audio. This is more precise than Whisper's default segment-level timestamps.

Does WhisperX support multiple languages?+

Yes. WhisperX inherits Whisper's multilingual support. You can specify the language with the --language flag or let it auto-detect. Forced alignment models are available for many languages.

What GPU do I need to run WhisperX?+

A CUDA-capable GPU with at least 8GB VRAM is recommended for the large-v2 model. Smaller models (medium, small, base) work on GPUs with less VRAM. CPU-only mode works but loses the speed advantage.

Can I use WhisperX for live streaming audio?+

WhisperX is designed for batch processing of recorded audio files. It is not optimized for real-time streaming. For live transcription, consider streaming-focused alternatives.

Citations (3)
🙏

Source & Thanks

Created by Max Bain. Licensed under BSD-2-Clause. m-bain/whisperX — 21,000+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets