whisper.cpp — Local Speech-to-Text in Pure C/C++
High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.
What it is
whisper.cpp is a high-performance C/C++ port of OpenAI's Whisper speech recognition model by Georgi Gerganov (creator of llama.cpp). It runs entirely locally with zero dependencies: no Python, no PyTorch, no internet connection needed.
The key advantage: it runs efficiently on CPU. Apple Silicon gets 4-8x speedup via Core ML and Metal. NVIDIA GPUs work via CUDA. Even a Raspberry Pi can transcribe audio. Real-time streaming transcription works on modern laptops.
How it saves time or tokens
whisper.cpp provides speech-to-text without cloud API costs or latency. Traditional Whisper requires Python, PyTorch, and ideally a GPU. whisper.cpp runs on any hardware with a single binary. For privacy-sensitive applications, all processing stays on-device. The tiny model (75 MB) transcribes at 32x real-time on CPU, making it practical for batch processing of audio archives.
How to use
- Clone, build, and download a model:
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build --config Release
bash models/download-ggml-model.sh base.en
- Transcribe an audio file:
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav
- Real-time microphone transcription:
./build/bin/whisper-stream -m models/ggml-base.en.bin
# Speak into your microphone -- text appears in real time
Example
Model size comparison for different use cases:
| Model | Disk | RAM | Speed (CPU) | Quality |
|--------|---------|---------|----------------|------------------|
| tiny | 75 MB | ~390 MB | ~32x real-time | Good for drafts |
| base | 142 MB | ~500 MB | ~16x real-time | Solid accuracy |
| small | 466 MB | ~1 GB | ~6x real-time | Good quality |
| medium | 1.5 GB | ~2.6 GB | ~2x real-time | High quality |
| large | 2.9 GB | ~4.7 GB | ~1x real-time | Best quality |
# Output formats
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav -otxt # Plain text
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav -osrt # SRT subtitles
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav -ovtt # VTT subtitles
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav -ojson # JSON with timestamps
Related on TokRepo
- AI tools for voice — More speech and voice tools on TokRepo.
- Local LLM tools — Browse local AI inference tools.
Common pitfalls
- Using the large model on hardware without a GPU leads to very slow transcription. Start with base or small for CPU-only setups.
- Audio files must be 16kHz 16-bit mono WAV. Convert other formats with ffmpeg before processing.
- Real-time streaming requires a low-latency audio capture setup. Ensure your microphone input is configured correctly for the whisper-stream binary.
Frequently Asked Questions
No. whisper.cpp runs on CPU by default. GPU acceleration via CUDA (NVIDIA), Metal (Apple), and Core ML (Apple) is optional and provides significant speedups. Even a Raspberry Pi can run the tiny model.
whisper.cpp provides the same transcription quality (it uses the same model weights) but runs without Python dependencies. It is faster on CPU and uses less memory. The tradeoff is that it requires manual compilation.
whisper.cpp requires 16kHz 16-bit mono WAV input. Convert other formats using ffmpeg: ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav.
Yes. The whisper-stream binary captures audio from your microphone and transcribes it in real time. This works with the tiny and base models on modern hardware.
whisper.cpp outputs plain text, SRT subtitles, VTT subtitles, JSON with timestamps, and CSV. Choose the format with -otxt, -osrt, -ovtt, -ojson, or -ocsv flags.
Citations (3)
- whisper.cpp GitHub— whisper.cpp C/C++ port of OpenAI Whisper
- OpenAI Whisper— OpenAI Whisper speech recognition model
- arXiv Whisper Paper— Whisper model architecture and training
Related on TokRepo
Source & Thanks
- GitHub: ggerganov/whisper.cpp — 37,000+ stars, MIT License
- By Georgi Gerganov (also creator of llama.cpp)
Discussion
Related Assets
AlphaFold — AI-Powered 3D Protein Structure Prediction
AlphaFold by Google DeepMind predicts three-dimensional protein structures from amino acid sequences with atomic-level accuracy, enabling breakthroughs in drug discovery, enzyme engineering, and structural biology research.
Flash Attention — Fast Memory-Efficient Exact Attention for Transformers
Flash Attention is a CUDA kernel library that computes exact scaled dot-product attention 2-4x faster and with up to 20x less memory than standard implementations by using IO-aware tiling to minimize GPU memory reads and writes.
ChatGLM — Open Bilingual Chat Model by Tsinghua KEG
ChatGLM is a family of open bilingual language models from Tsinghua University that support English and Chinese conversation, code generation, and tool use, with variants optimized for consumer GPU deployment.