Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 19, 2026·3 min de lectura

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
CosyVoice Overview
Comando CLI universal
npx tokrepo install 7141df5f-537e-11f1-9bc6-00163e2b0d79

Introduction

CosyVoice is a large-scale text-to-speech model that uses an LLM backbone to generate natural, expressive speech. It handles multilingual synthesis, voice cloning from a short reference clip, and controllable speaking styles without per-speaker fine-tuning.

What CosyVoice Does

  • Generates speech in 9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
  • Performs zero-shot voice cloning from a few seconds of reference audio
  • Supports streaming TTS for real-time applications
  • Provides instruction-following synthesis for emotion and style control
  • Enables cross-lingual voice cloning (clone a voice and speak in a different language)

Architecture Overview

CosyVoice uses a two-stage pipeline. The first stage is an autoregressive LLM that converts text tokens and speaker embeddings into semantic speech tokens. The second stage is a flow-matching-based acoustic model that transforms semantic tokens into a mel spectrogram, which a HiFi-GAN vocoder renders into a waveform. Speaker identity is captured by a reference encoder that extracts a fixed-dimensional embedding from a short audio prompt.

Self-Hosting & Configuration

  • Clone the repo and install dependencies (Python 3.10+, PyTorch 2.0+)
  • Download pretrained model weights via the provided script or from ModelScope/Hugging Face
  • Launch the Gradio web UI with webui.py for interactive testing
  • Configure GPU memory, batch size, and streaming chunk size in the config YAML
  • Deploy as an API server using the included FastAPI wrapper for production use

Key Features

  • LLM-based architecture produces more natural prosody than traditional TTS pipelines
  • Zero-shot cloning requires only 3-10 seconds of reference audio
  • Streaming mode enables sub-200ms first-chunk latency for real-time applications
  • Supports fine-tuning on custom data for domain adaptation
  • Covers 18+ Chinese regional dialects and accents

Comparison with Similar Tools

  • Bark — generates speech, music, and sound effects; CosyVoice focuses on high-fidelity multilingual speech
  • F5-TTS — flow-matching TTS with zero-shot cloning; CosyVoice adds an LLM stage for better prosody
  • Kokoro — lightweight 82M-parameter TTS; CosyVoice trades model size for richer multilingual and style control
  • Fish Speech — multilingual TTS with VITS architecture; CosyVoice uses an LLM backbone for longer context
  • GPT-SoVITS — few-shot voice cloning focused on Chinese; CosyVoice supports 9 languages natively

FAQ

Q: How much reference audio is needed for voice cloning? A: As little as 3 seconds, though 5-10 seconds of clean speech produces better results.

Q: Can CosyVoice run in real-time? A: Yes. Streaming mode delivers audio chunks with low latency, suitable for voice assistants and live applications.

Q: What hardware is required? A: A single GPU with 8 GB VRAM is sufficient for inference. Training and fine-tuning require more resources.

Q: Is commercial use allowed? A: CosyVoice is released under the Apache 2.0 license, permitting commercial use.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados