Configs2026年5月19日·1 分钟阅读

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
CosyVoice Overview
通用 CLI 安装命令
npx tokrepo install 7141df5f-537e-11f1-9bc6-00163e2b0d79

Introduction

CosyVoice is a large-scale text-to-speech model that uses an LLM backbone to generate natural, expressive speech. It handles multilingual synthesis, voice cloning from a short reference clip, and controllable speaking styles without per-speaker fine-tuning.

What CosyVoice Does

  • Generates speech in 9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
  • Performs zero-shot voice cloning from a few seconds of reference audio
  • Supports streaming TTS for real-time applications
  • Provides instruction-following synthesis for emotion and style control
  • Enables cross-lingual voice cloning (clone a voice and speak in a different language)

Architecture Overview

CosyVoice uses a two-stage pipeline. The first stage is an autoregressive LLM that converts text tokens and speaker embeddings into semantic speech tokens. The second stage is a flow-matching-based acoustic model that transforms semantic tokens into a mel spectrogram, which a HiFi-GAN vocoder renders into a waveform. Speaker identity is captured by a reference encoder that extracts a fixed-dimensional embedding from a short audio prompt.

Self-Hosting & Configuration

  • Clone the repo and install dependencies (Python 3.10+, PyTorch 2.0+)
  • Download pretrained model weights via the provided script or from ModelScope/Hugging Face
  • Launch the Gradio web UI with webui.py for interactive testing
  • Configure GPU memory, batch size, and streaming chunk size in the config YAML
  • Deploy as an API server using the included FastAPI wrapper for production use

Key Features

  • LLM-based architecture produces more natural prosody than traditional TTS pipelines
  • Zero-shot cloning requires only 3-10 seconds of reference audio
  • Streaming mode enables sub-200ms first-chunk latency for real-time applications
  • Supports fine-tuning on custom data for domain adaptation
  • Covers 18+ Chinese regional dialects and accents

Comparison with Similar Tools

  • Bark — generates speech, music, and sound effects; CosyVoice focuses on high-fidelity multilingual speech
  • F5-TTS — flow-matching TTS with zero-shot cloning; CosyVoice adds an LLM stage for better prosody
  • Kokoro — lightweight 82M-parameter TTS; CosyVoice trades model size for richer multilingual and style control
  • Fish Speech — multilingual TTS with VITS architecture; CosyVoice uses an LLM backbone for longer context
  • GPT-SoVITS — few-shot voice cloning focused on Chinese; CosyVoice supports 9 languages natively

FAQ

Q: How much reference audio is needed for voice cloning? A: As little as 3 seconds, though 5-10 seconds of clean speech produces better results.

Q: Can CosyVoice run in real-time? A: Yes. Streaming mode delivers audio chunks with low latency, suitable for voice assistants and live applications.

Q: What hardware is required? A: A single GPU with 8 GB VRAM is sufficient for inference. Training and fine-tuning require more resources.

Q: Is commercial use allowed? A: CosyVoice is released under the Apache 2.0 license, permitting commercial use.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产