代码2026年4月2日·1 分钟阅读

whisper.cpp — Local Speech-to-Text in Pure C/C++

High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
whisper-cpp.md
直接安装命令
npx -y tokrepo@latest install e1fd7c46-bbda-4956-8649-9c3ed579ff25 --target codex

先 dry-run 确认安装计划,再运行此命令。

TL;DR
whisper.cpp ports OpenAI's Whisper ASR to plain C/C++. Transcribes audio at realtime+ on M-series Macs with zero Python, zero GPU, zero runtime.
§01

The Python tax on Whisper

OpenAI's original Whisper release depends on PyTorch, which means a 2GB dependency tree, GPU VRAM pressure, and slow cold starts. whisper.cpp is a ground-up port in pure C/C++ with zero runtime dependencies — a single 5MB binary that transcribes audio faster than realtime on any modern laptop.

§02

60-second install

# macOS
brew install whisper-cpp
whisper-cpp -m /path/to/model.bin -f audio.wav

# From source (any OS)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build --config Release

# Download a model
bash models/download-ggml-model.sh base.en

# Transcribe
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav
§03

Model sizes and speed tradeoffs (M2 Pro MacBook, 16GB)

ModelSizeSpeed (60s audio)WER (English)
tiny39 MB1.2s14%
base74 MB2.8s9%
small244 MB8.5s6%
medium769 MB21s4.5%
large-v31.5 GB58s2.8%

For most voice-memo and meeting transcription, base.en is the sweet spot. For multilingual or professional audio, use large-v3.

§04

Platforms that work

  • macOS (Apple Silicon: CoreML + Metal GPU acceleration)
  • Linux (CUDA, AMD ROCm, Vulkan, pure CPU)
  • Windows (MSVC + CUDA)
  • Raspberry Pi 4/5
  • iOS / Android (via JNI + Core ML)
  • WebAssembly (yes, whisper.cpp runs in the browser)
§05

Real-time microphone transcription

./build/bin/stream -m models/ggml-base.en.bin --step 500 --length 5000

Streaming mode updates every 500ms with a 5s rolling window — good enough for live captioning, meeting notes, or voice-controlled agents.

§06

Integration with AI agents

Pattern: voice-to-text feed for Claude Code

# Record voice memo
ffmpeg -f avfoundation -i ":0" -t 30 memo.wav

# Transcribe with whisper.cpp
TEXT=$(whisper-cpp -m ~/models/ggml-base.en.bin -f memo.wav -otxt -nt)

# Pipe into Claude
claude "$TEXT"

Sub-1 second latency on M2+ machines. Entirely offline.

§07

Quantization for low-RAM devices

whisper.cpp supports 4-bit quantization via ggml — shrinks large-v3 from 1.5GB to 380MB with ~1% WER increase. Essential for mobile and Pi deployments:

./build/bin/quantize models/ggml-large-v3.bin models/ggml-large-v3-q4.bin q4_0
§08

Benchmarks vs alternatives (2026)

SolutionLatency (60s audio, CPU)OfflineDependencies
whisper.cpp (base)2.8sNone
openai-whisper (Python)38sPyTorch 2GB
faster-whisper (CTranslate2)6.1sCTranslate2
Cloud APIs (Deepgram)network-boundAPI key

whisper.cpp wins on offline latency by 3-10× margins.

§09

Known limitations

  1. Speaker diarization not built-in. Pair with whisper-diarization or pyannote.audio for who-said-what.
  2. Long audio memory costlarge-v3 needs ~3GB RAM for continuous transcription. Chunk with the -sns flag.
  3. Punctuation in streaming mode is weaker than offline mode — use non-streaming for final transcripts.

常见问题

Does whisper.cpp really work without Python?+

Yes. whisper.cpp is a pure C/C++ implementation with no Python dependency at runtime. The binary is approximately 5MB and has no external runtime requirements other than the model file. This makes it ideal for embedded, mobile, and serverless deployments.

How fast is it on a MacBook compared to Python Whisper?+

On an M2 Pro MacBook, whisper.cpp transcribes 60 seconds of audio with the base.en model in approximately 2.8 seconds. The Python openai-whisper equivalent takes around 38 seconds on the same hardware — a 13x speedup, primarily from Metal GPU acceleration.

Which model size should I use?+

For most voice memos and meeting transcription in English, base.en offers the best speed-accuracy tradeoff at 74MB and approximately 9% word error rate. For professional audio or multilingual content, use large-v3 at 1.5GB with 2.8% WER. Tiny is only recommended for extremely low-resource devices.

Can it run on Raspberry Pi?+

Yes. whisper.cpp supports Raspberry Pi 4 and 5 with ARM NEON optimizations. With the tiny or base model plus 4-bit quantization, it achieves near-realtime transcription on Pi 5. The quantized large-v3-q4 model runs but transcribes slower than realtime.

Does it support streaming transcription?+

Yes. The stream binary processes audio in rolling windows with configurable step and length parameters. A typical 500ms step with 5-second window gives you live captioning suitable for meeting notes or voice-controlled agents.

引用来源 (3)
🙏

来源与感谢

  • GitHub: ggerganov/whisper.cpp — 37,000+ stars, MIT License
  • By Georgi Gerganov (also creator of llama.cpp)

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产