How do I install CosyVoice — Multilingual Voice Generation with LLM-Based TTS?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

Introduction

CosyVoice is a large-scale text-to-speech model that uses an LLM backbone to generate natural, expressive speech. It handles multilingual synthesis, voice cloning from a short reference clip, and controllable speaking styles without per-speaker fine-tuning.

What CosyVoice Does

Generates speech in 9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
Performs zero-shot voice cloning from a few seconds of reference audio
Supports streaming TTS for real-time applications
Provides instruction-following synthesis for emotion and style control
Enables cross-lingual voice cloning (clone a voice and speak in a different language)

Architecture Overview

CosyVoice uses a two-stage pipeline. The first stage is an autoregressive LLM that converts text tokens and speaker embeddings into semantic speech tokens. The second stage is a flow-matching-based acoustic model that transforms semantic tokens into a mel spectrogram, which a HiFi-GAN vocoder renders into a waveform. Speaker identity is captured by a reference encoder that extracts a fixed-dimensional embedding from a short audio prompt.

Self-Hosting & Configuration

Clone the repo and install dependencies (Python 3.10+, PyTorch 2.0+)
Download pretrained model weights via the provided script or from ModelScope/Hugging Face
Launch the Gradio web UI with webui.py for interactive testing
Configure GPU memory, batch size, and streaming chunk size in the config YAML
Deploy as an API server using the included FastAPI wrapper for production use

Key Features

LLM-based architecture produces more natural prosody than traditional TTS pipelines
Zero-shot cloning requires only 3-10 seconds of reference audio
Streaming mode enables sub-200ms first-chunk latency for real-time applications
Supports fine-tuning on custom data for domain adaptation
Covers 18+ Chinese regional dialects and accents

Comparison with Similar Tools

Bark — generates speech, music, and sound effects; CosyVoice focuses on high-fidelity multilingual speech
F5-TTS — flow-matching TTS with zero-shot cloning; CosyVoice adds an LLM stage for better prosody
Kokoro — lightweight 82M-parameter TTS; CosyVoice trades model size for richer multilingual and style control
Fish Speech — multilingual TTS with VITS architecture; CosyVoice uses an LLM backbone for longer context
GPT-SoVITS — few-shot voice cloning focused on Chinese; CosyVoice supports 9 languages natively

FAQ

Q: How much reference audio is needed for voice cloning? A: As little as 3 seconds, though 5-10 seconds of clean speech produces better results.

Q: Can CosyVoice run in real-time? A: Yes. Streaming mode delivers audio chunks with low latency, suitable for voice assistants and live applications.

Q: What hardware is required? A: A single GPU with 8 GB VRAM is sufficient for inference. Training and fine-tuning require more resources.

Q: Is commercial use allowed? A: CosyVoice is released under the Apache 2.0 license, permitting commercial use.

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

这个资产可以被 Agent 直接读取和安装

Introduction

What CosyVoice Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Fish Speech — Multilingual TTS for 80+ Languages

GPT-SoVITS — Few-Shot Voice Cloning and Text-to-Speech

Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

SenseVoice — Multilingual Speech Understanding Model