Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 12, 2026·2 min de lecture

Vosk — Offline Speech Recognition API for Any Platform

Vosk provides offline speech recognition for Android, iOS, Raspberry Pi, and servers with support for 20+ languages, all without an internet connection.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Vosk Speech Recognition
Commande CLI universelle
npx tokrepo install 15ab68d2-4ddc-11f1-9bc6-00163e2b0d79

Introduction

Vosk is an offline speech recognition toolkit that runs entirely on-device without sending audio to the cloud. It wraps the Kaldi ASR engine into a developer-friendly API available in Python, Java, C#, Node.js, and more, enabling low-latency transcription on everything from a Raspberry Pi to a production server.

What Vosk Does

  • Transcribes audio to text in 20+ languages without internet
  • Provides real-time streaming recognition with partial results
  • Supports speaker identification alongside transcription
  • Runs on ARM devices including Raspberry Pi and Android
  • Offers lightweight models as small as 50 MB for embedded use

Architecture Overview

Vosk uses Kaldi's finite-state transducer decoding pipeline compiled into a shared library. Language and acoustic models are bundled into downloadable packages. The KaldiRecognizer class processes audio frames incrementally and emits JSON results with transcribed text, confidence scores, and word-level timestamps.

Self-Hosting & Configuration

  • Install via pip, npm, NuGet, or Maven depending on your stack
  • Download a pre-trained model from the Vosk model repository
  • Point the Model constructor to the extracted model directory
  • Set sample rate to match your audio source (typically 16000 Hz)
  • Deploy vosk-server for WebSocket-based real-time transcription

Key Features

  • Fully offline operation with no cloud dependency
  • Small-footprint models for constrained hardware (50-300 MB)
  • Word-level timestamps and confidence scores in JSON output
  • Speaker diarization to identify who is speaking
  • WebSocket server mode for scalable deployments

Comparison with Similar Tools

  • Whisper — higher accuracy but requires more compute; Vosk excels on edge devices
  • DeepSpeech — discontinued; Vosk is actively maintained with broader language support
  • Google Speech-to-Text — cloud-only and paid; Vosk runs offline and free
  • whisper.cpp — efficient Whisper port but lacks Vosk's streaming partial-result API

FAQ

Q: Does Vosk require a GPU? A: No. Vosk runs on CPU and is optimized for low-power devices.

Q: What audio formats does Vosk accept? A: Raw PCM audio (mono, 16-bit). Use ffmpeg to convert other formats.

Q: Can I train a custom model? A: Yes. Vosk models are standard Kaldi models that can be trained with the Kaldi toolkit.

Q: How does streaming work? A: Call AcceptWaveform in a loop with audio chunks; partial results arrive immediately.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires