Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 19, 2026·3 min de lecture

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
CosyVoice Overview
Commande CLI universelle
npx tokrepo install 7141df5f-537e-11f1-9bc6-00163e2b0d79

Introduction

CosyVoice is a large-scale text-to-speech model that uses an LLM backbone to generate natural, expressive speech. It handles multilingual synthesis, voice cloning from a short reference clip, and controllable speaking styles without per-speaker fine-tuning.

What CosyVoice Does

  • Generates speech in 9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
  • Performs zero-shot voice cloning from a few seconds of reference audio
  • Supports streaming TTS for real-time applications
  • Provides instruction-following synthesis for emotion and style control
  • Enables cross-lingual voice cloning (clone a voice and speak in a different language)

Architecture Overview

CosyVoice uses a two-stage pipeline. The first stage is an autoregressive LLM that converts text tokens and speaker embeddings into semantic speech tokens. The second stage is a flow-matching-based acoustic model that transforms semantic tokens into a mel spectrogram, which a HiFi-GAN vocoder renders into a waveform. Speaker identity is captured by a reference encoder that extracts a fixed-dimensional embedding from a short audio prompt.

Self-Hosting & Configuration

  • Clone the repo and install dependencies (Python 3.10+, PyTorch 2.0+)
  • Download pretrained model weights via the provided script or from ModelScope/Hugging Face
  • Launch the Gradio web UI with webui.py for interactive testing
  • Configure GPU memory, batch size, and streaming chunk size in the config YAML
  • Deploy as an API server using the included FastAPI wrapper for production use

Key Features

  • LLM-based architecture produces more natural prosody than traditional TTS pipelines
  • Zero-shot cloning requires only 3-10 seconds of reference audio
  • Streaming mode enables sub-200ms first-chunk latency for real-time applications
  • Supports fine-tuning on custom data for domain adaptation
  • Covers 18+ Chinese regional dialects and accents

Comparison with Similar Tools

  • Bark — generates speech, music, and sound effects; CosyVoice focuses on high-fidelity multilingual speech
  • F5-TTS — flow-matching TTS with zero-shot cloning; CosyVoice adds an LLM stage for better prosody
  • Kokoro — lightweight 82M-parameter TTS; CosyVoice trades model size for richer multilingual and style control
  • Fish Speech — multilingual TTS with VITS architecture; CosyVoice uses an LLM backbone for longer context
  • GPT-SoVITS — few-shot voice cloning focused on Chinese; CosyVoice supports 9 languages natively

FAQ

Q: How much reference audio is needed for voice cloning? A: As little as 3 seconds, though 5-10 seconds of clean speech produces better results.

Q: Can CosyVoice run in real-time? A: Yes. Streaming mode delivers audio chunks with low latency, suitable for voice assistants and live applications.

Q: What hardware is required? A: A single GPU with 8 GB VRAM is sufficient for inference. Training and fine-tuning require more resources.

Q: Is commercial use allowed? A: CosyVoice is released under the Apache 2.0 license, permitting commercial use.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires