Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsApr 2, 2026·3 min de lectura

Coqui TTS — Deep Learning Text-to-Speech Engine

Generate speech in 1100+ languages with voice cloning. XTTS v2 streams with under 200ms latency. 44K+ GitHub stars.

Introducción

Coqui TTS is the most comprehensive open-source text-to-speech library with 44,900+ GitHub stars, supporting 1,100+ languages via pretrained models. Its flagship XTTS v2 model delivers production-quality multilingual speech with voice cloning in just 6 seconds of reference audio and under 200ms streaming latency. The library implements every major TTS architecture — VITS, Tacotron 2, Glow-TTS, Bark, Tortoise — with a unified Python API and CLI. While Coqui the company closed in 2023, the open-source project remains the go-to TTS toolkit for developers worldwide.

Works with: Python, CUDA GPUs, CPU (slower), any application via CLI or Python API. Best for developers adding voice to AI agents, chatbots, accessibility tools, or content creation pipelines. Setup time: under 3 minutes.


Coqui TTS Model Zoo & Features

Model Architectures

Model Type Quality Speed Voice Clone
XTTS v2 End-to-end ★★★★★ Fast (GPU) ✅ 6s reference
VITS End-to-end ★★★★ Very fast
YourTTS Multi-speaker ★★★★ Fast
Bark Generative ★★★★ Slow ❌ (but expressive)
Tortoise Diffusion ★★★★★ Very slow
Tacotron 2 Spectrogram ★★★ Medium
Glow-TTS Flow-based ★★★ Fast

XTTS v2 — Flagship Model

The recommended model for most use cases:

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# 16 supported languages
languages = ["en", "es", "fr", "de", "it", "pt", "pl", "tr",
             "ru", "nl", "cs", "ar", "zh-cn", "ja", "hu", "ko"]

# Voice cloning from 6-second reference
tts.tts_to_file(
    text="This is my cloned voice speaking.",
    speaker_wav="reference.wav",  # Just 6 seconds needed
    language="en",
    file_path="cloned_output.wav"
)

Features:

  • 16 languages with natural prosody
  • Voice cloning from just 6 seconds of reference audio
  • Streaming with under 200ms latency
  • Emotion preservation from reference audio

Streaming TTS

from TTS.api import TTS
import sounddevice as sd
import numpy as np

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Stream audio chunks in real-time
chunks = tts.tts_stream(
    text="This streams in real-time with very low latency.",
    speaker_wav="reference.wav",
    language="en"
)

for chunk in chunks:
    sd.play(np.array(chunk), samplerate=24000)
    sd.wait()

Fine-Tuning

Train on your own voice data:

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.fine_tune(
    dataset_path="my_voice_dataset/",
    output_path="my_finetuned_model/",
    num_epochs=10,
    batch_size=4,
)

TTS Server

Run as a REST API:

tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2 --port 5002
# POST text, get audio
curl -X POST http://localhost:5002/api/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "language": "en"}' \
  --output speech.wav

FAQ

Q: What is Coqui TTS? A: Coqui TTS is the most popular open-source text-to-speech library with 44,900+ GitHub stars, supporting 1,100+ languages, voice cloning, and multiple architectures (XTTS v2, VITS, Bark, Tortoise) via a unified Python API.

Q: Is Coqui TTS still maintained after the company shut down? A: The company closed in 2023, but the open-source library continues to be widely used and community-maintained. XTTS v2 remains one of the best open-source TTS models available.

Q: Is Coqui TTS free? A: Yes, open-source under MPL-2.0 (Mozilla Public License). Free for commercial and non-commercial use.


🙏

Fuente y agradecimientos

Created by Coqui AI. Licensed under MPL-2.0.

TTS — ⭐ 44,900+

Thanks to the Coqui AI team and community for building the most comprehensive open-source TTS toolkit.

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados