ConfigsMay 19, 2026·3 min read

SenseVoice — Multilingual Speech Understanding Model

SenseVoice is an open-source speech foundation model by Alibaba's FunAudioLLM team that performs automatic speech recognition, language identification, speech emotion recognition, and audio event detection in a single model. It supports 50+ languages and runs significantly faster than Whisper.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
SenseVoice Overview
Universal CLI install command
npx tokrepo install fe36c7c0-537e-11f1-9bc6-00163e2b0d79

Introduction

SenseVoice goes beyond speech-to-text by combining ASR with speech emotion recognition, spoken language identification, and audio event detection in a single forward pass. Trained on over 400,000 hours of data, it achieves high accuracy across 50+ languages with inference speeds far exceeding Whisper.

What SenseVoice Does

  • Transcribes speech in 50+ languages with high accuracy
  • Detects the spoken language automatically from audio input
  • Recognizes speaker emotions (happy, sad, angry, neutral, etc.) from voice
  • Identifies non-speech audio events like applause, laughter, music, and crying
  • Provides all four capabilities simultaneously in a single inference call

Architecture Overview

SenseVoice uses an encoder-only Transformer architecture with multi-task prediction heads. The shared audio encoder processes mel-spectrogram features through a stack of Conformer blocks. Task-specific output heads branch from the shared representation to produce ASR tokens, language labels, emotion labels, and audio event labels. The SenseVoice-Small variant has a parameter count comparable to Whisper-Small but achieves significantly lower latency through non-autoregressive decoding.

Self-Hosting & Configuration

  • Install via FunASR: pip install funasr (Python 3.8+)
  • Models download automatically from ModelScope or Hugging Face on first use
  • Available in two sizes: SenseVoice-Small (fast, lightweight) and SenseVoice-Large (higher accuracy)
  • Set language='auto' for automatic language detection or specify a language code
  • Deploy in production using FunASR's gRPC/WebSocket server for concurrent requests

Key Features

  • Unified model handles ASR, language ID, emotion, and audio events without separate pipelines
  • Inference speed is 5x faster than Whisper-Small and 15x faster than Whisper-Large
  • Supports rich transcription with emotion and event tags embedded in output
  • Works well on noisy audio and multi-speaker scenarios
  • Fine-tunable on domain-specific data using FunASR training scripts

Comparison with Similar Tools

  • Whisper (OpenAI) — strong multilingual ASR but autoregressive and slower; SenseVoice adds emotion and event detection
  • Faster Whisper — accelerated Whisper inference; SenseVoice is natively faster due to non-autoregressive architecture
  • FunASR Paraformer — non-autoregressive ASR; SenseVoice adds multi-task understanding beyond transcription
  • wav2vec 2.0 — self-supervised speech representation; SenseVoice is a complete end-to-end recognition system
  • WhisperX — adds word-level timestamps to Whisper; SenseVoice provides emotion and event detection instead

FAQ

Q: How does SenseVoice compare to Whisper in accuracy? A: SenseVoice matches or exceeds Whisper on standard benchmarks for supported languages, while running significantly faster.

Q: Can I use SenseVoice for real-time applications? A: Yes. SenseVoice-Small is fast enough for real-time transcription, and FunASR's server supports streaming WebSocket connections.

Q: What format does the emotion output take? A: Emotion labels are returned as tags (e.g., , ) alongside the transcription text.

Q: Is commercial use permitted? A: SenseVoice models are released under permissive licenses. Check the specific model card on ModelScope or Hugging Face for license details.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets