# Vosk — Offline Speech Recognition API for Any Platform

> Vosk provides offline speech recognition for Android, iOS, Raspberry Pi, and servers with support for 20+ languages, all without an internet connection.

## Install

Save as a script file and run:

# Vosk — Offline Speech Recognition API for Any Platform

## Quick Use
```bash
pip install vosk
python -c "
from vosk import Model, KaldiRecognizer
import wave
model = Model(model_name='vosk-model-small-en-us-0.15')
wf = wave.open('test.wav', 'rb')
rec = KaldiRecognizer(model, wf.getframerate())
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    rec.AcceptWaveform(data)
print(rec.FinalResult())
"
```

## Introduction
Vosk is an offline speech recognition toolkit that runs entirely on-device without sending audio to the cloud. It wraps the Kaldi ASR engine into a developer-friendly API available in Python, Java, C#, Node.js, and more, enabling low-latency transcription on everything from a Raspberry Pi to a production server.

## What Vosk Does
- Transcribes audio to text in 20+ languages without internet
- Provides real-time streaming recognition with partial results
- Supports speaker identification alongside transcription
- Runs on ARM devices including Raspberry Pi and Android
- Offers lightweight models as small as 50 MB for embedded use

## Architecture Overview
Vosk uses Kaldi's finite-state transducer decoding pipeline compiled into a shared library. Language and acoustic models are bundled into downloadable packages. The KaldiRecognizer class processes audio frames incrementally and emits JSON results with transcribed text, confidence scores, and word-level timestamps.

## Self-Hosting & Configuration
- Install via pip, npm, NuGet, or Maven depending on your stack
- Download a pre-trained model from the Vosk model repository
- Point the Model constructor to the extracted model directory
- Set sample rate to match your audio source (typically 16000 Hz)
- Deploy vosk-server for WebSocket-based real-time transcription

## Key Features
- Fully offline operation with no cloud dependency
- Small-footprint models for constrained hardware (50-300 MB)
- Word-level timestamps and confidence scores in JSON output
- Speaker diarization to identify who is speaking
- WebSocket server mode for scalable deployments

## Comparison with Similar Tools
- **Whisper** — higher accuracy but requires more compute; Vosk excels on edge devices
- **DeepSpeech** — discontinued; Vosk is actively maintained with broader language support
- **Google Speech-to-Text** — cloud-only and paid; Vosk runs offline and free
- **whisper.cpp** — efficient Whisper port but lacks Vosk's streaming partial-result API

## FAQ
**Q: Does Vosk require a GPU?**
A: No. Vosk runs on CPU and is optimized for low-power devices.

**Q: What audio formats does Vosk accept?**
A: Raw PCM audio (mono, 16-bit). Use ffmpeg to convert other formats.

**Q: Can I train a custom model?**
A: Yes. Vosk models are standard Kaldi models that can be trained with the Kaldi toolkit.

**Q: How does streaming work?**
A: Call AcceptWaveform in a loop with audio chunks; partial results arrive immediately.

## Sources
- https://github.com/alphacep/vosk-api
- https://alphacephei.com/vosk/

---
Source: https://tokrepo.com/en/workflows/asset-15ab68d2
Author: Script Depot