Introduction
FunASR provides a complete pipeline for automatic speech recognition, from audio input to formatted text output. It bundles state-of-the-art pretrained models (Paraformer, SenseVoice, Whisper-compatible) with convenient Python APIs and a deployable gRPC/WebSocket server.
What FunASR Does
- Performs speech-to-text transcription for 50+ languages with pretrained models
- Detects voice activity to segment audio into speech and silence regions
- Restores punctuation and performs inverse text normalization on transcriptions
- Supports both offline (batch) and online (streaming) recognition modes
- Provides a runtime server for production deployment with GPU acceleration
Architecture Overview
FunASR's core is built on PyTorch and wraps multiple ASR architectures (Paraformer, Conformer, Transformer, Whisper) behind a unified AutoModel interface. The Paraformer model uses a non-autoregressive architecture with a predictor module that estimates token count, enabling single-pass parallel decoding. The runtime server is a C++ gRPC service that loads ONNX-exported models with ONNX Runtime for low-latency inference, accepting WebSocket connections for streaming audio.
Self-Hosting & Configuration
- Install via pip: pip install funasr (Python 3.8+)
- Models download automatically from ModelScope or Hugging Face on first use
- Deploy the production server using the Docker image: funasr-runtime-sdk-gpu
- Configure the server via command-line flags for model paths, ports, and thread count
- Stream audio to the server over WebSocket for real-time transcription
Key Features
- Paraformer achieves fast non-autoregressive decoding with high accuracy on Chinese and English
- Streaming mode delivers partial results with low latency for live captioning
- Supports hotword boosting to improve recognition of domain-specific terms
- Includes speaker diarization to distinguish who is speaking
- Production C++ runtime with ONNX optimization for enterprise deployment
Comparison with Similar Tools
- Whisper (OpenAI) — strong multilingual ASR; FunASR offers faster non-autoregressive models and a production server
- whisper.cpp — C++ Whisper inference; FunASR provides a broader toolkit with VAD, punctuation, and diarization
- Faster Whisper — CTranslate2-based speedup; FunASR's Paraformer is natively non-autoregressive for even lower latency
- Vosk — offline speech recognition; FunASR supports both streaming and batch with a wider model zoo
- DeepSpeech — Mozilla's end-to-end ASR (archived); FunASR is actively maintained with newer architectures
FAQ
Q: Which languages are supported? A: FunASR ships models covering 50+ languages, with particular strength in Chinese (including 7 dialects and 26 accents), English, Japanese, and Korean.
Q: Can I fine-tune models on my own data? A: Yes. FunASR provides training scripts and recipes for fine-tuning any supported model on custom datasets.
Q: What is the recommended deployment for production? A: Use the Docker-based runtime server with GPU support. It handles concurrent WebSocket connections and delivers optimized throughput via ONNX Runtime.
Q: How does Paraformer compare to Whisper in speed? A: Paraformer's non-autoregressive decoding is significantly faster than Whisper's autoregressive approach, especially on long audio segments.