Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 19, 2026·3 min de lecture

SenseVoice — Multilingual Speech Understanding Model

SenseVoice is an open-source speech foundation model by Alibaba's FunAudioLLM team that performs automatic speech recognition, language identification, speech emotion recognition, and audio event detection in a single model. It supports 50+ languages and runs significantly faster than Whisper.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
SenseVoice Overview
Commande CLI universelle
npx tokrepo install fe36c7c0-537e-11f1-9bc6-00163e2b0d79

Introduction

SenseVoice goes beyond speech-to-text by combining ASR with speech emotion recognition, spoken language identification, and audio event detection in a single forward pass. Trained on over 400,000 hours of data, it achieves high accuracy across 50+ languages with inference speeds far exceeding Whisper.

What SenseVoice Does

  • Transcribes speech in 50+ languages with high accuracy
  • Detects the spoken language automatically from audio input
  • Recognizes speaker emotions (happy, sad, angry, neutral, etc.) from voice
  • Identifies non-speech audio events like applause, laughter, music, and crying
  • Provides all four capabilities simultaneously in a single inference call

Architecture Overview

SenseVoice uses an encoder-only Transformer architecture with multi-task prediction heads. The shared audio encoder processes mel-spectrogram features through a stack of Conformer blocks. Task-specific output heads branch from the shared representation to produce ASR tokens, language labels, emotion labels, and audio event labels. The SenseVoice-Small variant has a parameter count comparable to Whisper-Small but achieves significantly lower latency through non-autoregressive decoding.

Self-Hosting & Configuration

  • Install via FunASR: pip install funasr (Python 3.8+)
  • Models download automatically from ModelScope or Hugging Face on first use
  • Available in two sizes: SenseVoice-Small (fast, lightweight) and SenseVoice-Large (higher accuracy)
  • Set language='auto' for automatic language detection or specify a language code
  • Deploy in production using FunASR's gRPC/WebSocket server for concurrent requests

Key Features

  • Unified model handles ASR, language ID, emotion, and audio events without separate pipelines
  • Inference speed is 5x faster than Whisper-Small and 15x faster than Whisper-Large
  • Supports rich transcription with emotion and event tags embedded in output
  • Works well on noisy audio and multi-speaker scenarios
  • Fine-tunable on domain-specific data using FunASR training scripts

Comparison with Similar Tools

  • Whisper (OpenAI) — strong multilingual ASR but autoregressive and slower; SenseVoice adds emotion and event detection
  • Faster Whisper — accelerated Whisper inference; SenseVoice is natively faster due to non-autoregressive architecture
  • FunASR Paraformer — non-autoregressive ASR; SenseVoice adds multi-task understanding beyond transcription
  • wav2vec 2.0 — self-supervised speech representation; SenseVoice is a complete end-to-end recognition system
  • WhisperX — adds word-level timestamps to Whisper; SenseVoice provides emotion and event detection instead

FAQ

Q: How does SenseVoice compare to Whisper in accuracy? A: SenseVoice matches or exceeds Whisper on standard benchmarks for supported languages, while running significantly faster.

Q: Can I use SenseVoice for real-time applications? A: Yes. SenseVoice-Small is fast enough for real-time transcription, and FunASR's server supports streaming WebSocket connections.

Q: What format does the emotion output take? A: Emotion labels are returned as tags (e.g., , ) alongside the transcription text.

Q: Is commercial use permitted? A: SenseVoice models are released under permissive licenses. Check the specific model card on ModelScope or Hugging Face for license details.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires