# CosyVoice — Multilingual Voice Generation with LLM-Based TTS > CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control. ## Install Save in your project root: # CosyVoice — Multilingual Voice Generation with LLM-Based TTS ## Quick Use ```bash git clone https://github.com/FunAudioLLM/CosyVoice.git cd CosyVoice pip install -r requirements.txt python webui.py --port 8080 ``` ## Introduction CosyVoice is a large-scale text-to-speech model that uses an LLM backbone to generate natural, expressive speech. It handles multilingual synthesis, voice cloning from a short reference clip, and controllable speaking styles without per-speaker fine-tuning. ## What CosyVoice Does - Generates speech in 9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian - Performs zero-shot voice cloning from a few seconds of reference audio - Supports streaming TTS for real-time applications - Provides instruction-following synthesis for emotion and style control - Enables cross-lingual voice cloning (clone a voice and speak in a different language) ## Architecture Overview CosyVoice uses a two-stage pipeline. The first stage is an autoregressive LLM that converts text tokens and speaker embeddings into semantic speech tokens. The second stage is a flow-matching-based acoustic model that transforms semantic tokens into a mel spectrogram, which a HiFi-GAN vocoder renders into a waveform. Speaker identity is captured by a reference encoder that extracts a fixed-dimensional embedding from a short audio prompt. ## Self-Hosting & Configuration - Clone the repo and install dependencies (Python 3.10+, PyTorch 2.0+) - Download pretrained model weights via the provided script or from ModelScope/Hugging Face - Launch the Gradio web UI with webui.py for interactive testing - Configure GPU memory, batch size, and streaming chunk size in the config YAML - Deploy as an API server using the included FastAPI wrapper for production use ## Key Features - LLM-based architecture produces more natural prosody than traditional TTS pipelines - Zero-shot cloning requires only 3-10 seconds of reference audio - Streaming mode enables sub-200ms first-chunk latency for real-time applications - Supports fine-tuning on custom data for domain adaptation - Covers 18+ Chinese regional dialects and accents ## Comparison with Similar Tools - **Bark** — generates speech, music, and sound effects; CosyVoice focuses on high-fidelity multilingual speech - **F5-TTS** — flow-matching TTS with zero-shot cloning; CosyVoice adds an LLM stage for better prosody - **Kokoro** — lightweight 82M-parameter TTS; CosyVoice trades model size for richer multilingual and style control - **Fish Speech** — multilingual TTS with VITS architecture; CosyVoice uses an LLM backbone for longer context - **GPT-SoVITS** — few-shot voice cloning focused on Chinese; CosyVoice supports 9 languages natively ## FAQ **Q: How much reference audio is needed for voice cloning?** A: As little as 3 seconds, though 5-10 seconds of clean speech produces better results. **Q: Can CosyVoice run in real-time?** A: Yes. Streaming mode delivers audio chunks with low latency, suitable for voice assistants and live applications. **Q: What hardware is required?** A: A single GPU with 8 GB VRAM is sufficient for inference. Training and fine-tuning require more resources. **Q: Is commercial use allowed?** A: CosyVoice is released under the Apache 2.0 license, permitting commercial use. ## Sources - https://github.com/FunAudioLLM/CosyVoice - https://fun-audio-llm.github.io/cosyvoice/ --- Source: https://tokrepo.com/en/workflows/asset-7141df5f Author: AI Open Source