Coqui TTS Model Zoo & Features
Model Architectures
| Model | Type | Quality | Speed | Voice Clone |
|---|---|---|---|---|
| XTTS v2 | End-to-end | ★★★★★ | Fast (GPU) | ✅ 6s reference |
| VITS | End-to-end | ★★★★ | Very fast | ❌ |
| YourTTS | Multi-speaker | ★★★★ | Fast | ✅ |
| Bark | Generative | ★★★★ | Slow | ❌ (but expressive) |
| Tortoise | Diffusion | ★★★★★ | Very slow | ✅ |
| Tacotron 2 | Spectrogram | ★★★ | Medium | ❌ |
| Glow-TTS | Flow-based | ★★★ | Fast | ❌ |
XTTS v2 — Flagship Model
The recommended model for most use cases:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
# 16 supported languages
languages = ["en", "es", "fr", "de", "it", "pt", "pl", "tr",
"ru", "nl", "cs", "ar", "zh-cn", "ja", "hu", "ko"]
# Voice cloning from 6-second reference
tts.tts_to_file(
text="This is my cloned voice speaking.",
speaker_wav="reference.wav", # Just 6 seconds needed
language="en",
file_path="cloned_output.wav"
)Features:
- 16 languages with natural prosody
- Voice cloning from just 6 seconds of reference audio
- Streaming with under 200ms latency
- Emotion preservation from reference audio
Streaming TTS
from TTS.api import TTS
import sounddevice as sd
import numpy as np
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
# Stream audio chunks in real-time
chunks = tts.tts_stream(
text="This streams in real-time with very low latency.",
speaker_wav="reference.wav",
language="en"
)
for chunk in chunks:
sd.play(np.array(chunk), samplerate=24000)
sd.wait()Fine-Tuning
Train on your own voice data:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.fine_tune(
dataset_path="my_voice_dataset/",
output_path="my_finetuned_model/",
num_epochs=10,
batch_size=4,
)TTS Server
Run as a REST API:
tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2 --port 5002# POST text, get audio
curl -X POST http://localhost:5002/api/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "language": "en"}' \
--output speech.wavFAQ
Q: What is Coqui TTS? A: Coqui TTS is the most popular open-source text-to-speech library with 44,900+ GitHub stars, supporting 1,100+ languages, voice cloning, and multiple architectures (XTTS v2, VITS, Bark, Tortoise) via a unified Python API.
Q: Is Coqui TTS still maintained after the company shut down? A: The company closed in 2023, but the open-source library continues to be widely used and community-maintained. XTTS v2 remains one of the best open-source TTS models available.
Q: Is Coqui TTS free? A: Yes, open-source under MPL-2.0 (Mozilla Public License). Free for commercial and non-commercial use.