代码2026年4月2日·1 分钟阅读

whisper.cpp — Local Speech-to-Text in Pure C/C++

High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.

Script Depot · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Established

入口

whisper-cpp.md

直接安装命令

npx -y tokrepo@latest install e1fd7c46-bbda-4956-8649-9c3ed579ff25 --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

whisper.cpp ports OpenAI's Whisper ASR to plain C/C++. Transcribes audio at realtime+ on M-series Macs with zero Python, zero GPU, zero runtime.

§01

The Python tax on Whisper

OpenAI's original Whisper release depends on PyTorch, which means a 2GB dependency tree, GPU VRAM pressure, and slow cold starts. whisper.cpp is a ground-up port in pure C/C++ with zero runtime dependencies — a single 5MB binary that transcribes audio faster than realtime on any modern laptop.

§02

60-second install

# macOS
brew install whisper-cpp
whisper-cpp -m /path/to/model.bin -f audio.wav

# From source (any OS)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build --config Release

# Download a model
bash models/download-ggml-model.sh base.en

# Transcribe
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav

§03

Model sizes and speed tradeoffs (M2 Pro MacBook, 16GB)

Model	Size	Speed (60s audio)	WER (English)
tiny	39 MB	1.2s	14%
base	74 MB	2.8s	9%
small	244 MB	8.5s	6%
medium	769 MB	21s	4.5%
large-v3	1.5 GB	58s	2.8%

For most voice-memo and meeting transcription, base.en is the sweet spot. For multilingual or professional audio, use large-v3.

§04

Platforms that work

macOS (Apple Silicon: CoreML + Metal GPU acceleration)
Linux (CUDA, AMD ROCm, Vulkan, pure CPU)
Windows (MSVC + CUDA)
Raspberry Pi 4/5
iOS / Android (via JNI + Core ML)
WebAssembly (yes, whisper.cpp runs in the browser)

§05

Real-time microphone transcription

./build/bin/stream -m models/ggml-base.en.bin --step 500 --length 5000

Streaming mode updates every 500ms with a 5s rolling window — good enough for live captioning, meeting notes, or voice-controlled agents.

§06

Integration with AI agents

Pattern: voice-to-text feed for Claude Code

# Record voice memo
ffmpeg -f avfoundation -i ":0" -t 30 memo.wav

# Transcribe with whisper.cpp
TEXT=$(whisper-cpp -m ~/models/ggml-base.en.bin -f memo.wav -otxt -nt)

# Pipe into Claude
claude "$TEXT"

Sub-1 second latency on M2+ machines. Entirely offline.

§07

Quantization for low-RAM devices

whisper.cpp supports 4-bit quantization via ggml — shrinks large-v3 from 1.5GB to 380MB with ~1% WER increase. Essential for mobile and Pi deployments:

./build/bin/quantize models/ggml-large-v3.bin models/ggml-large-v3-q4.bin q4_0

§08

Benchmarks vs alternatives (2026)

Solution	Latency (60s audio, CPU)	Offline	Dependencies
whisper.cpp (base)	2.8s	✅	None
openai-whisper (Python)	38s	✅	PyTorch 2GB
faster-whisper (CTranslate2)	6.1s	✅	CTranslate2
Cloud APIs (Deepgram)	network-bound	❌	API key

whisper.cpp wins on offline latency by 3-10× margins.

§09

Known limitations

Speaker diarization not built-in. Pair with whisper-diarization or pyannote.audio for who-said-what.
Long audio memory cost — large-v3 needs ~3GB RAM for continuous transcription. Chunk with the -sns flag.
Punctuation in streaming mode is weaker than offline mode — use non-streaming for final transcripts.

常见问题

Does whisper.cpp really work without Python?+

Yes. whisper.cpp is a pure C/C++ implementation with no Python dependency at runtime. The binary is approximately 5MB and has no external runtime requirements other than the model file. This makes it ideal for embedded, mobile, and serverless deployments.

How fast is it on a MacBook compared to Python Whisper?+

On an M2 Pro MacBook, whisper.cpp transcribes 60 seconds of audio with the base.en model in approximately 2.8 seconds. The Python openai-whisper equivalent takes around 38 seconds on the same hardware — a 13x speedup, primarily from Metal GPU acceleration.

Which model size should I use?+

For most voice memos and meeting transcription in English, base.en offers the best speed-accuracy tradeoff at 74MB and approximately 9% word error rate. For professional audio or multilingual content, use large-v3 at 1.5GB with 2.8% WER. Tiny is only recommended for extremely low-resource devices.

Can it run on Raspberry Pi?+

Yes. whisper.cpp supports Raspberry Pi 4 and 5 with ARM NEON optimizations. With the tiny or base model plus 4-bit quantization, it achieves near-realtime transcription on Pi 5. The quantized large-v3-q4 model runs but transcribes slower than realtime.

Does it support streaming transcription?+

Yes. The stream binary processes audio in rolling windows with configurable step and length parameters. A typical 500ms step with 5-second window gives you live captioning suitable for meeting notes or voice-controlled agents.

引用来源 (3)

whisper.cpp benchmarks— whisper.cpp achieves 2.8s for 60s audio on M2 Pro with base.en model
whisper.cpp GitHub— 37K+ GitHub stars
OpenAI Whisper paper— Whisper model originally released by OpenAI in 2022

🙏

来源与感谢

GitHub: ggerganov/whisper.cpp — 37,000+ stars, MIT License
By Georgi Gerganov (also creator of llama.cpp)

讨论

登录后参与讨论。

还没有评论，来写第一条吧。