# Vosk — Offline Speech Recognition API for Any Platform > Vosk provides offline speech recognition for Android, iOS, Raspberry Pi, and servers with support for 20+ languages, all without an internet connection. ## Install Save as a script file and run: # Vosk — Offline Speech Recognition API for Any Platform ## Quick Use ```bash pip install vosk python -c " from vosk import Model, KaldiRecognizer import wave model = Model(model_name='vosk-model-small-en-us-0.15') wf = wave.open('test.wav', 'rb') rec = KaldiRecognizer(model, wf.getframerate()) while True: data = wf.readframes(4000) if len(data) == 0: break rec.AcceptWaveform(data) print(rec.FinalResult()) " ``` ## Introduction Vosk is an offline speech recognition toolkit that runs entirely on-device without sending audio to the cloud. It wraps the Kaldi ASR engine into a developer-friendly API available in Python, Java, C#, Node.js, and more, enabling low-latency transcription on everything from a Raspberry Pi to a production server. ## What Vosk Does - Transcribes audio to text in 20+ languages without internet - Provides real-time streaming recognition with partial results - Supports speaker identification alongside transcription - Runs on ARM devices including Raspberry Pi and Android - Offers lightweight models as small as 50 MB for embedded use ## Architecture Overview Vosk uses Kaldi's finite-state transducer decoding pipeline compiled into a shared library. Language and acoustic models are bundled into downloadable packages. The KaldiRecognizer class processes audio frames incrementally and emits JSON results with transcribed text, confidence scores, and word-level timestamps. ## Self-Hosting & Configuration - Install via pip, npm, NuGet, or Maven depending on your stack - Download a pre-trained model from the Vosk model repository - Point the Model constructor to the extracted model directory - Set sample rate to match your audio source (typically 16000 Hz) - Deploy vosk-server for WebSocket-based real-time transcription ## Key Features - Fully offline operation with no cloud dependency - Small-footprint models for constrained hardware (50-300 MB) - Word-level timestamps and confidence scores in JSON output - Speaker diarization to identify who is speaking - WebSocket server mode for scalable deployments ## Comparison with Similar Tools - **Whisper** — higher accuracy but requires more compute; Vosk excels on edge devices - **DeepSpeech** — discontinued; Vosk is actively maintained with broader language support - **Google Speech-to-Text** — cloud-only and paid; Vosk runs offline and free - **whisper.cpp** — efficient Whisper port but lacks Vosk's streaming partial-result API ## FAQ **Q: Does Vosk require a GPU?** A: No. Vosk runs on CPU and is optimized for low-power devices. **Q: What audio formats does Vosk accept?** A: Raw PCM audio (mono, 16-bit). Use ffmpeg to convert other formats. **Q: Can I train a custom model?** A: Yes. Vosk models are standard Kaldi models that can be trained with the Kaldi toolkit. **Q: How does streaming work?** A: Call AcceptWaveform in a loop with audio chunks; partial results arrive immediately. ## Sources - https://github.com/alphacep/vosk-api - https://alphacephei.com/vosk/ --- Source: https://tokrepo.com/en/workflows/asset-15ab68d2 Author: Script Depot