ConfigsMay 19, 2026·3 min read

FunASR — End-to-End Speech Recognition Toolkit

FunASR is an open-source speech recognition toolkit by Alibaba DAMO Academy supporting ASR, voice activity detection, punctuation restoration, and text normalization. It ships pretrained models for 50+ languages and provides production-ready server deployment with streaming support.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
FunASR Overview
Universal CLI install command
npx tokrepo install 9e95d508-537e-11f1-9bc6-00163e2b0d79

Introduction

FunASR provides a complete pipeline for automatic speech recognition, from audio input to formatted text output. It bundles state-of-the-art pretrained models (Paraformer, SenseVoice, Whisper-compatible) with convenient Python APIs and a deployable gRPC/WebSocket server.

What FunASR Does

  • Performs speech-to-text transcription for 50+ languages with pretrained models
  • Detects voice activity to segment audio into speech and silence regions
  • Restores punctuation and performs inverse text normalization on transcriptions
  • Supports both offline (batch) and online (streaming) recognition modes
  • Provides a runtime server for production deployment with GPU acceleration

Architecture Overview

FunASR's core is built on PyTorch and wraps multiple ASR architectures (Paraformer, Conformer, Transformer, Whisper) behind a unified AutoModel interface. The Paraformer model uses a non-autoregressive architecture with a predictor module that estimates token count, enabling single-pass parallel decoding. The runtime server is a C++ gRPC service that loads ONNX-exported models with ONNX Runtime for low-latency inference, accepting WebSocket connections for streaming audio.

Self-Hosting & Configuration

  • Install via pip: pip install funasr (Python 3.8+)
  • Models download automatically from ModelScope or Hugging Face on first use
  • Deploy the production server using the Docker image: funasr-runtime-sdk-gpu
  • Configure the server via command-line flags for model paths, ports, and thread count
  • Stream audio to the server over WebSocket for real-time transcription

Key Features

  • Paraformer achieves fast non-autoregressive decoding with high accuracy on Chinese and English
  • Streaming mode delivers partial results with low latency for live captioning
  • Supports hotword boosting to improve recognition of domain-specific terms
  • Includes speaker diarization to distinguish who is speaking
  • Production C++ runtime with ONNX optimization for enterprise deployment

Comparison with Similar Tools

  • Whisper (OpenAI) — strong multilingual ASR; FunASR offers faster non-autoregressive models and a production server
  • whisper.cpp — C++ Whisper inference; FunASR provides a broader toolkit with VAD, punctuation, and diarization
  • Faster Whisper — CTranslate2-based speedup; FunASR's Paraformer is natively non-autoregressive for even lower latency
  • Vosk — offline speech recognition; FunASR supports both streaming and batch with a wider model zoo
  • DeepSpeech — Mozilla's end-to-end ASR (archived); FunASR is actively maintained with newer architectures

FAQ

Q: Which languages are supported? A: FunASR ships models covering 50+ languages, with particular strength in Chinese (including 7 dialects and 26 accents), English, Japanese, and Korean.

Q: Can I fine-tune models on my own data? A: Yes. FunASR provides training scripts and recipes for fine-tuning any supported model on custom datasets.

Q: What is the recommended deployment for production? A: Use the Docker-based runtime server with GPU support. It handles concurrent WebSocket connections and delivers optimized throughput via ONNX Runtime.

Q: How does Paraformer compare to Whisper in speed? A: Paraformer's non-autoregressive decoding is significantly faster than Whisper's autoregressive approach, especially on long audio segments.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets