# CosyVoice — Multilingual Voice Generation with LLM-Based TTS

> CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control.

## Install

Save in your project root:

# CosyVoice — Multilingual Voice Generation with LLM-Based TTS

## Quick Use
```bash
git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
pip install -r requirements.txt
python webui.py --port 8080
```

## Introduction
CosyVoice is a large-scale text-to-speech model that uses an LLM backbone to generate natural, expressive speech. It handles multilingual synthesis, voice cloning from a short reference clip, and controllable speaking styles without per-speaker fine-tuning.

## What CosyVoice Does
- Generates speech in 9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
- Performs zero-shot voice cloning from a few seconds of reference audio
- Supports streaming TTS for real-time applications
- Provides instruction-following synthesis for emotion and style control
- Enables cross-lingual voice cloning (clone a voice and speak in a different language)

## Architecture Overview
CosyVoice uses a two-stage pipeline. The first stage is an autoregressive LLM that converts text tokens and speaker embeddings into semantic speech tokens. The second stage is a flow-matching-based acoustic model that transforms semantic tokens into a mel spectrogram, which a HiFi-GAN vocoder renders into a waveform. Speaker identity is captured by a reference encoder that extracts a fixed-dimensional embedding from a short audio prompt.

## Self-Hosting & Configuration
- Clone the repo and install dependencies (Python 3.10+, PyTorch 2.0+)
- Download pretrained model weights via the provided script or from ModelScope/Hugging Face
- Launch the Gradio web UI with webui.py for interactive testing
- Configure GPU memory, batch size, and streaming chunk size in the config YAML
- Deploy as an API server using the included FastAPI wrapper for production use

## Key Features
- LLM-based architecture produces more natural prosody than traditional TTS pipelines
- Zero-shot cloning requires only 3-10 seconds of reference audio
- Streaming mode enables sub-200ms first-chunk latency for real-time applications
- Supports fine-tuning on custom data for domain adaptation
- Covers 18+ Chinese regional dialects and accents

## Comparison with Similar Tools
- **Bark** — generates speech, music, and sound effects; CosyVoice focuses on high-fidelity multilingual speech
- **F5-TTS** — flow-matching TTS with zero-shot cloning; CosyVoice adds an LLM stage for better prosody
- **Kokoro** — lightweight 82M-parameter TTS; CosyVoice trades model size for richer multilingual and style control
- **Fish Speech** — multilingual TTS with VITS architecture; CosyVoice uses an LLM backbone for longer context
- **GPT-SoVITS** — few-shot voice cloning focused on Chinese; CosyVoice supports 9 languages natively

## FAQ
**Q: How much reference audio is needed for voice cloning?**
A: As little as 3 seconds, though 5-10 seconds of clean speech produces better results.

**Q: Can CosyVoice run in real-time?**
A: Yes. Streaming mode delivers audio chunks with low latency, suitable for voice assistants and live applications.

**Q: What hardware is required?**
A: A single GPU with 8 GB VRAM is sufficient for inference. Training and fine-tuning require more resources.

**Q: Is commercial use allowed?**
A: CosyVoice is released under the Apache 2.0 license, permitting commercial use.

## Sources
- https://github.com/FunAudioLLM/CosyVoice
- https://fun-audio-llm.github.io/cosyvoice/

---
Source: https://tokrepo.com/en/workflows/asset-7141df5f
Author: AI Open Source