Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 19, 2026·3 min de lecture

FunASR — End-to-End Speech Recognition Toolkit

FunASR is an open-source speech recognition toolkit by Alibaba DAMO Academy supporting ASR, voice activity detection, punctuation restoration, and text normalization. It ships pretrained models for 50+ languages and provides production-ready server deployment with streaming support.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
FunASR Overview
Commande CLI universelle
npx tokrepo install 9e95d508-537e-11f1-9bc6-00163e2b0d79

Introduction

FunASR provides a complete pipeline for automatic speech recognition, from audio input to formatted text output. It bundles state-of-the-art pretrained models (Paraformer, SenseVoice, Whisper-compatible) with convenient Python APIs and a deployable gRPC/WebSocket server.

What FunASR Does

  • Performs speech-to-text transcription for 50+ languages with pretrained models
  • Detects voice activity to segment audio into speech and silence regions
  • Restores punctuation and performs inverse text normalization on transcriptions
  • Supports both offline (batch) and online (streaming) recognition modes
  • Provides a runtime server for production deployment with GPU acceleration

Architecture Overview

FunASR's core is built on PyTorch and wraps multiple ASR architectures (Paraformer, Conformer, Transformer, Whisper) behind a unified AutoModel interface. The Paraformer model uses a non-autoregressive architecture with a predictor module that estimates token count, enabling single-pass parallel decoding. The runtime server is a C++ gRPC service that loads ONNX-exported models with ONNX Runtime for low-latency inference, accepting WebSocket connections for streaming audio.

Self-Hosting & Configuration

  • Install via pip: pip install funasr (Python 3.8+)
  • Models download automatically from ModelScope or Hugging Face on first use
  • Deploy the production server using the Docker image: funasr-runtime-sdk-gpu
  • Configure the server via command-line flags for model paths, ports, and thread count
  • Stream audio to the server over WebSocket for real-time transcription

Key Features

  • Paraformer achieves fast non-autoregressive decoding with high accuracy on Chinese and English
  • Streaming mode delivers partial results with low latency for live captioning
  • Supports hotword boosting to improve recognition of domain-specific terms
  • Includes speaker diarization to distinguish who is speaking
  • Production C++ runtime with ONNX optimization for enterprise deployment

Comparison with Similar Tools

  • Whisper (OpenAI) — strong multilingual ASR; FunASR offers faster non-autoregressive models and a production server
  • whisper.cpp — C++ Whisper inference; FunASR provides a broader toolkit with VAD, punctuation, and diarization
  • Faster Whisper — CTranslate2-based speedup; FunASR's Paraformer is natively non-autoregressive for even lower latency
  • Vosk — offline speech recognition; FunASR supports both streaming and batch with a wider model zoo
  • DeepSpeech — Mozilla's end-to-end ASR (archived); FunASR is actively maintained with newer architectures

FAQ

Q: Which languages are supported? A: FunASR ships models covering 50+ languages, with particular strength in Chinese (including 7 dialects and 26 accents), English, Japanese, and Korean.

Q: Can I fine-tune models on my own data? A: Yes. FunASR provides training scripts and recipes for fine-tuning any supported model on custom datasets.

Q: What is the recommended deployment for production? A: Use the Docker-based runtime server with GPU support. It handles concurrent WebSocket connections and delivers optimized throughput via ONNX Runtime.

Q: How does Paraformer compare to Whisper in speed? A: Paraformer's non-autoregressive decoding is significantly faster than Whisper's autoregressive approach, especially on long audio segments.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires