Configs2026年5月19日·1 分钟阅读

SenseVoice — Multilingual Speech Understanding Model

SenseVoice is an open-source speech foundation model by Alibaba's FunAudioLLM team that performs automatic speech recognition, language identification, speech emotion recognition, and audio event detection in a single model. It supports 50+ languages and runs significantly faster than Whisper.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
SenseVoice Overview
通用 CLI 安装命令
npx tokrepo install fe36c7c0-537e-11f1-9bc6-00163e2b0d79

Introduction

SenseVoice goes beyond speech-to-text by combining ASR with speech emotion recognition, spoken language identification, and audio event detection in a single forward pass. Trained on over 400,000 hours of data, it achieves high accuracy across 50+ languages with inference speeds far exceeding Whisper.

What SenseVoice Does

  • Transcribes speech in 50+ languages with high accuracy
  • Detects the spoken language automatically from audio input
  • Recognizes speaker emotions (happy, sad, angry, neutral, etc.) from voice
  • Identifies non-speech audio events like applause, laughter, music, and crying
  • Provides all four capabilities simultaneously in a single inference call

Architecture Overview

SenseVoice uses an encoder-only Transformer architecture with multi-task prediction heads. The shared audio encoder processes mel-spectrogram features through a stack of Conformer blocks. Task-specific output heads branch from the shared representation to produce ASR tokens, language labels, emotion labels, and audio event labels. The SenseVoice-Small variant has a parameter count comparable to Whisper-Small but achieves significantly lower latency through non-autoregressive decoding.

Self-Hosting & Configuration

  • Install via FunASR: pip install funasr (Python 3.8+)
  • Models download automatically from ModelScope or Hugging Face on first use
  • Available in two sizes: SenseVoice-Small (fast, lightweight) and SenseVoice-Large (higher accuracy)
  • Set language='auto' for automatic language detection or specify a language code
  • Deploy in production using FunASR's gRPC/WebSocket server for concurrent requests

Key Features

  • Unified model handles ASR, language ID, emotion, and audio events without separate pipelines
  • Inference speed is 5x faster than Whisper-Small and 15x faster than Whisper-Large
  • Supports rich transcription with emotion and event tags embedded in output
  • Works well on noisy audio and multi-speaker scenarios
  • Fine-tunable on domain-specific data using FunASR training scripts

Comparison with Similar Tools

  • Whisper (OpenAI) — strong multilingual ASR but autoregressive and slower; SenseVoice adds emotion and event detection
  • Faster Whisper — accelerated Whisper inference; SenseVoice is natively faster due to non-autoregressive architecture
  • FunASR Paraformer — non-autoregressive ASR; SenseVoice adds multi-task understanding beyond transcription
  • wav2vec 2.0 — self-supervised speech representation; SenseVoice is a complete end-to-end recognition system
  • WhisperX — adds word-level timestamps to Whisper; SenseVoice provides emotion and event detection instead

FAQ

Q: How does SenseVoice compare to Whisper in accuracy? A: SenseVoice matches or exceeds Whisper on standard benchmarks for supported languages, while running significantly faster.

Q: Can I use SenseVoice for real-time applications? A: Yes. SenseVoice-Small is fast enough for real-time transcription, and FunASR's server supports streaming WebSocket connections.

Q: What format does the emotion output take? A: Emotion labels are returned as tags (e.g., , ) alongside the transcription text.

Q: Is commercial use permitted? A: SenseVoice models are released under permissive licenses. Check the specific model card on ModelScope or Hugging Face for license details.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产