Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 19, 2026·3 min de lectura

FunASR — End-to-End Speech Recognition Toolkit

FunASR is an open-source speech recognition toolkit by Alibaba DAMO Academy supporting ASR, voice activity detection, punctuation restoration, and text normalization. It ships pretrained models for 50+ languages and provides production-ready server deployment with streaming support.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
FunASR Overview
Comando CLI universal
npx tokrepo install 9e95d508-537e-11f1-9bc6-00163e2b0d79

Introduction

FunASR provides a complete pipeline for automatic speech recognition, from audio input to formatted text output. It bundles state-of-the-art pretrained models (Paraformer, SenseVoice, Whisper-compatible) with convenient Python APIs and a deployable gRPC/WebSocket server.

What FunASR Does

  • Performs speech-to-text transcription for 50+ languages with pretrained models
  • Detects voice activity to segment audio into speech and silence regions
  • Restores punctuation and performs inverse text normalization on transcriptions
  • Supports both offline (batch) and online (streaming) recognition modes
  • Provides a runtime server for production deployment with GPU acceleration

Architecture Overview

FunASR's core is built on PyTorch and wraps multiple ASR architectures (Paraformer, Conformer, Transformer, Whisper) behind a unified AutoModel interface. The Paraformer model uses a non-autoregressive architecture with a predictor module that estimates token count, enabling single-pass parallel decoding. The runtime server is a C++ gRPC service that loads ONNX-exported models with ONNX Runtime for low-latency inference, accepting WebSocket connections for streaming audio.

Self-Hosting & Configuration

  • Install via pip: pip install funasr (Python 3.8+)
  • Models download automatically from ModelScope or Hugging Face on first use
  • Deploy the production server using the Docker image: funasr-runtime-sdk-gpu
  • Configure the server via command-line flags for model paths, ports, and thread count
  • Stream audio to the server over WebSocket for real-time transcription

Key Features

  • Paraformer achieves fast non-autoregressive decoding with high accuracy on Chinese and English
  • Streaming mode delivers partial results with low latency for live captioning
  • Supports hotword boosting to improve recognition of domain-specific terms
  • Includes speaker diarization to distinguish who is speaking
  • Production C++ runtime with ONNX optimization for enterprise deployment

Comparison with Similar Tools

  • Whisper (OpenAI) — strong multilingual ASR; FunASR offers faster non-autoregressive models and a production server
  • whisper.cpp — C++ Whisper inference; FunASR provides a broader toolkit with VAD, punctuation, and diarization
  • Faster Whisper — CTranslate2-based speedup; FunASR's Paraformer is natively non-autoregressive for even lower latency
  • Vosk — offline speech recognition; FunASR supports both streaming and batch with a wider model zoo
  • DeepSpeech — Mozilla's end-to-end ASR (archived); FunASR is actively maintained with newer architectures

FAQ

Q: Which languages are supported? A: FunASR ships models covering 50+ languages, with particular strength in Chinese (including 7 dialects and 26 accents), English, Japanese, and Korean.

Q: Can I fine-tune models on my own data? A: Yes. FunASR provides training scripts and recipes for fine-tuning any supported model on custom datasets.

Q: What is the recommended deployment for production? A: Use the Docker-based runtime server with GPU support. It handles concurrent WebSocket connections and delivers optimized throughput via ONNX Runtime.

Q: How does Paraformer compare to Whisper in speed? A: Paraformer's non-autoregressive decoding is significantly faster than Whisper's autoregressive approach, especially on long audio segments.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados