Configs2026年5月19日·1 分钟阅读

FunASR — End-to-End Speech Recognition Toolkit

FunASR is an open-source speech recognition toolkit by Alibaba DAMO Academy supporting ASR, voice activity detection, punctuation restoration, and text normalization. It ships pretrained models for 50+ languages and provides production-ready server deployment with streaming support.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
FunASR Overview
通用 CLI 安装命令
npx tokrepo install 9e95d508-537e-11f1-9bc6-00163e2b0d79

Introduction

FunASR provides a complete pipeline for automatic speech recognition, from audio input to formatted text output. It bundles state-of-the-art pretrained models (Paraformer, SenseVoice, Whisper-compatible) with convenient Python APIs and a deployable gRPC/WebSocket server.

What FunASR Does

  • Performs speech-to-text transcription for 50+ languages with pretrained models
  • Detects voice activity to segment audio into speech and silence regions
  • Restores punctuation and performs inverse text normalization on transcriptions
  • Supports both offline (batch) and online (streaming) recognition modes
  • Provides a runtime server for production deployment with GPU acceleration

Architecture Overview

FunASR's core is built on PyTorch and wraps multiple ASR architectures (Paraformer, Conformer, Transformer, Whisper) behind a unified AutoModel interface. The Paraformer model uses a non-autoregressive architecture with a predictor module that estimates token count, enabling single-pass parallel decoding. The runtime server is a C++ gRPC service that loads ONNX-exported models with ONNX Runtime for low-latency inference, accepting WebSocket connections for streaming audio.

Self-Hosting & Configuration

  • Install via pip: pip install funasr (Python 3.8+)
  • Models download automatically from ModelScope or Hugging Face on first use
  • Deploy the production server using the Docker image: funasr-runtime-sdk-gpu
  • Configure the server via command-line flags for model paths, ports, and thread count
  • Stream audio to the server over WebSocket for real-time transcription

Key Features

  • Paraformer achieves fast non-autoregressive decoding with high accuracy on Chinese and English
  • Streaming mode delivers partial results with low latency for live captioning
  • Supports hotword boosting to improve recognition of domain-specific terms
  • Includes speaker diarization to distinguish who is speaking
  • Production C++ runtime with ONNX optimization for enterprise deployment

Comparison with Similar Tools

  • Whisper (OpenAI) — strong multilingual ASR; FunASR offers faster non-autoregressive models and a production server
  • whisper.cpp — C++ Whisper inference; FunASR provides a broader toolkit with VAD, punctuation, and diarization
  • Faster Whisper — CTranslate2-based speedup; FunASR's Paraformer is natively non-autoregressive for even lower latency
  • Vosk — offline speech recognition; FunASR supports both streaming and batch with a wider model zoo
  • DeepSpeech — Mozilla's end-to-end ASR (archived); FunASR is actively maintained with newer architectures

FAQ

Q: Which languages are supported? A: FunASR ships models covering 50+ languages, with particular strength in Chinese (including 7 dialects and 26 accents), English, Japanese, and Korean.

Q: Can I fine-tune models on my own data? A: Yes. FunASR provides training scripts and recipes for fine-tuning any supported model on custom datasets.

Q: What is the recommended deployment for production? A: Use the Docker-based runtime server with GPU support. It handles concurrent WebSocket connections and delivers optimized throughput via ONNX Runtime.

Q: How does Paraformer compare to Whisper in speed? A: Paraformer's non-autoregressive decoding is significantly faster than Whisper's autoregressive approach, especially on long audio segments.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产