Introduction
Vosk is an offline speech recognition toolkit that runs entirely on-device without sending audio to the cloud. It wraps the Kaldi ASR engine into a developer-friendly API available in Python, Java, C#, Node.js, and more, enabling low-latency transcription on everything from a Raspberry Pi to a production server.
What Vosk Does
- Transcribes audio to text in 20+ languages without internet
- Provides real-time streaming recognition with partial results
- Supports speaker identification alongside transcription
- Runs on ARM devices including Raspberry Pi and Android
- Offers lightweight models as small as 50 MB for embedded use
Architecture Overview
Vosk uses Kaldi's finite-state transducer decoding pipeline compiled into a shared library. Language and acoustic models are bundled into downloadable packages. The KaldiRecognizer class processes audio frames incrementally and emits JSON results with transcribed text, confidence scores, and word-level timestamps.
Self-Hosting & Configuration
- Install via pip, npm, NuGet, or Maven depending on your stack
- Download a pre-trained model from the Vosk model repository
- Point the Model constructor to the extracted model directory
- Set sample rate to match your audio source (typically 16000 Hz)
- Deploy vosk-server for WebSocket-based real-time transcription
Key Features
- Fully offline operation with no cloud dependency
- Small-footprint models for constrained hardware (50-300 MB)
- Word-level timestamps and confidence scores in JSON output
- Speaker diarization to identify who is speaking
- WebSocket server mode for scalable deployments
Comparison with Similar Tools
- Whisper — higher accuracy but requires more compute; Vosk excels on edge devices
- DeepSpeech — discontinued; Vosk is actively maintained with broader language support
- Google Speech-to-Text — cloud-only and paid; Vosk runs offline and free
- whisper.cpp — efficient Whisper port but lacks Vosk's streaming partial-result API
FAQ
Q: Does Vosk require a GPU? A: No. Vosk runs on CPU and is optimized for low-power devices.
Q: What audio formats does Vosk accept? A: Raw PCM audio (mono, 16-bit). Use ffmpeg to convert other formats.
Q: Can I train a custom model? A: Yes. Vosk models are standard Kaldi models that can be trained with the Kaldi toolkit.
Q: How does streaming work? A: Call AcceptWaveform in a loop with audio chunks; partial results arrive immediately.