Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 12, 2026·2 min de lecture

Vosk — Offline Speech Recognition API for Any Platform

Vosk provides offline speech recognition for Android, iOS, Raspberry Pi, and servers with support for 20+ languages, all without an internet connection.

Introduction

Vosk is an offline speech recognition toolkit that runs entirely on-device without sending audio to the cloud. It wraps the Kaldi ASR engine into a developer-friendly API available in Python, Java, C#, Node.js, and more, enabling low-latency transcription on everything from a Raspberry Pi to a production server.

What Vosk Does

  • Transcribes audio to text in 20+ languages without internet
  • Provides real-time streaming recognition with partial results
  • Supports speaker identification alongside transcription
  • Runs on ARM devices including Raspberry Pi and Android
  • Offers lightweight models as small as 50 MB for embedded use

Architecture Overview

Vosk uses Kaldi's finite-state transducer decoding pipeline compiled into a shared library. Language and acoustic models are bundled into downloadable packages. The KaldiRecognizer class processes audio frames incrementally and emits JSON results with transcribed text, confidence scores, and word-level timestamps.

Self-Hosting & Configuration

  • Install via pip, npm, NuGet, or Maven depending on your stack
  • Download a pre-trained model from the Vosk model repository
  • Point the Model constructor to the extracted model directory
  • Set sample rate to match your audio source (typically 16000 Hz)
  • Deploy vosk-server for WebSocket-based real-time transcription

Key Features

  • Fully offline operation with no cloud dependency
  • Small-footprint models for constrained hardware (50-300 MB)
  • Word-level timestamps and confidence scores in JSON output
  • Speaker diarization to identify who is speaking
  • WebSocket server mode for scalable deployments

Comparison with Similar Tools

  • Whisper — higher accuracy but requires more compute; Vosk excels on edge devices
  • DeepSpeech — discontinued; Vosk is actively maintained with broader language support
  • Google Speech-to-Text — cloud-only and paid; Vosk runs offline and free
  • whisper.cpp — efficient Whisper port but lacks Vosk's streaming partial-result API

FAQ

Q: Does Vosk require a GPU? A: No. Vosk runs on CPU and is optimized for low-power devices.

Q: What audio formats does Vosk accept? A: Raw PCM audio (mono, 16-bit). Use ffmpeg to convert other formats.

Q: Can I train a custom model? A: Yes. Vosk models are standard Kaldi models that can be trained with the Kaldi toolkit.

Q: How does streaming work? A: Call AcceptWaveform in a loop with audio chunks; partial results arrive immediately.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires