What is FunASR — End-to-End Speech Recognition Toolkit?

FunASR is an open-source speech recognition toolkit by Alibaba DAMO Academy supporting ASR, voice activity detection, punctuation restoration, and text normalization. It ships pretrained models for 50+ languages and provides production-ready server deployment with streaming support.

Is FunASR — End-to-End Speech Recognition Toolkit free to use?

Yes. FunASR — End-to-End Speech Recognition Toolkit is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install FunASR — End-to-End Speech Recognition Toolkit?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

FunASR — End-to-End Speech Recognition Toolkit

Introduction

FunASR provides a complete pipeline for automatic speech recognition, from audio input to formatted text output. It bundles state-of-the-art pretrained models (Paraformer, SenseVoice, Whisper-compatible) with convenient Python APIs and a deployable gRPC/WebSocket server.

What FunASR Does

Performs speech-to-text transcription for 50+ languages with pretrained models
Detects voice activity to segment audio into speech and silence regions
Restores punctuation and performs inverse text normalization on transcriptions
Supports both offline (batch) and online (streaming) recognition modes
Provides a runtime server for production deployment with GPU acceleration

Architecture Overview

FunASR's core is built on PyTorch and wraps multiple ASR architectures (Paraformer, Conformer, Transformer, Whisper) behind a unified AutoModel interface. The Paraformer model uses a non-autoregressive architecture with a predictor module that estimates token count, enabling single-pass parallel decoding. The runtime server is a C++ gRPC service that loads ONNX-exported models with ONNX Runtime for low-latency inference, accepting WebSocket connections for streaming audio.

Self-Hosting & Configuration

Install via pip: pip install funasr (Python 3.8+)
Models download automatically from ModelScope or Hugging Face on first use
Deploy the production server using the Docker image: funasr-runtime-sdk-gpu
Configure the server via command-line flags for model paths, ports, and thread count
Stream audio to the server over WebSocket for real-time transcription

Key Features

Paraformer achieves fast non-autoregressive decoding with high accuracy on Chinese and English
Streaming mode delivers partial results with low latency for live captioning
Supports hotword boosting to improve recognition of domain-specific terms
Includes speaker diarization to distinguish who is speaking
Production C++ runtime with ONNX optimization for enterprise deployment

Comparison with Similar Tools

Whisper (OpenAI) — strong multilingual ASR; FunASR offers faster non-autoregressive models and a production server
whisper.cpp — C++ Whisper inference; FunASR provides a broader toolkit with VAD, punctuation, and diarization
Faster Whisper — CTranslate2-based speedup; FunASR's Paraformer is natively non-autoregressive for even lower latency
Vosk — offline speech recognition; FunASR supports both streaming and batch with a wider model zoo
DeepSpeech — Mozilla's end-to-end ASR (archived); FunASR is actively maintained with newer architectures

FAQ

Q: Which languages are supported? A: FunASR ships models covering 50+ languages, with particular strength in Chinese (including 7 dialects and 26 accents), English, Japanese, and Korean.

Q: Can I fine-tune models on my own data? A: Yes. FunASR provides training scripts and recipes for fine-tuning any supported model on custom datasets.

Q: What is the recommended deployment for production? A: Use the Docker-based runtime server with GPU support. It handles concurrent WebSocket connections and delivers optimized throughput via ONNX Runtime.

Q: How does Paraformer compare to Whisper in speed? A: Paraformer's non-autoregressive decoding is significantly faster than Whisper's autoregressive approach, especially on long audio segments.

FunASR — End-to-End Speech Recognition Toolkit

这个资产可以被 Agent 直接读取和安装

Introduction

What FunASR Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

SenseVoice — Multilingual Speech Understanding Model

SpeechBrain — Open-Source All-in-One Speech and Audio Processing Toolkit

Fish Speech — Multilingual TTS for 80+ Languages

CosyVoice — Multilingual Voice Generation with LLM-Based TTS