Coqui TTS — Deep Learning Text-to-Speech Engine
Generate speech in 1100+ languages with voice cloning. XTTS v2 streams with under 200ms latency. 44K+ GitHub stars.
Staging sûr pour cet actif
Cet actif est d'abord staged. Le prompt copié demande à l'agent d'inspecter les fichiers staged avant d'activer scripts, config MCP ou config globale.
npx -y tokrepo@latest install a059dce2-6275-4ea0-a57b-e885248d8e95 --target codexStage les fichiers d'abord; l'activation exige la revue du README et du plan staged.
What it is
Coqui TTS is an open-source deep learning text-to-speech engine that supports over 1100 languages. Its XTTS v2 model enables voice cloning from short audio samples with streaming output under 200ms latency. You can generate speech from text, clone voices, and fine-tune models on custom datasets.
Coqui TTS targets developers building voice interfaces, accessibility tools, content creation pipelines, and any application that needs high-quality synthesized speech without proprietary API costs.
How it saves time or tokens
Coqui TTS runs locally, eliminating per-request API costs from cloud TTS services. The pre-trained models cover most languages out of the box. Voice cloning requires only a few seconds of reference audio, avoiding expensive studio recording sessions. The streaming API enables real-time voice output for interactive applications.
How to use
- Install Coqui TTS:
pip install TTS
- Generate speech from the command line:
tts --text 'Hello, this is a test.' \
--model_name tts_models/en/ljspeech/tacotron2-DDC \
--out_path output.wav
- Clone a voice with XTTS:
from TTS.api import TTS
tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2')
tts.tts_to_file(
text='Hello, this is my cloned voice.',
speaker_wav='reference_audio.wav',
language='en',
file_path='cloned_output.wav'
)
Example
Streaming TTS for real-time applications:
from TTS.api import TTS
import sounddevice as sd
import numpy as np
tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2')
wav = tts.tts(
text='Streaming text to speech in real time.',
speaker_wav='reference.wav',
language='en'
)
sd.play(np.array(wav), samplerate=24000)
sd.wait()
Related on TokRepo
- Voice tools — text-to-speech and voice AI resources
- AI coding tools — developer tools and libraries
Common pitfalls
- XTTS v2 requires a GPU for reasonable inference speed. CPU inference works but is too slow for real-time applications.
- Voice cloning quality depends on reference audio quality. Use clean, noise-free recordings of at least 6 seconds for best results.
- Model downloads are large (several GB). Plan for storage and bandwidth when deploying to new environments.
Questions fréquentes
Yes. All models run locally after download. No internet connection or API key is needed for inference. This makes it suitable for on-premises and privacy-sensitive deployments.
Coqui TTS supports over 1100 languages through its multilingual models. XTTS v2 specifically handles 17 languages with high quality. Other models cover additional languages.
Yes. Coqui TTS provides training scripts for fine-tuning on custom datasets. You need transcribed audio data in the expected format. Fine-tuning XTTS requires a GPU with at least 16GB VRAM.
Coqui TTS code is released under the Mozilla Public License 2.0. Individual model weights may have their own licenses. Check each model's license before commercial use.
XTTS v2 takes a short reference audio clip (3-10 seconds) and extracts speaker characteristics. It then generates new speech in that voice from any text input. No training or fine-tuning is needed for zero-shot cloning.
Sources citées (3)
- Coqui TTS GitHub— Coqui TTS supports 1100+ languages with XTTS v2 voice cloning
- Coqui TTS README— XTTS v2 streams with under 200ms latency
- Coqui TTS License— Mozilla Public License 2.0
En lien sur TokRepo
Source et remerciements
Fil de discussion
Actifs similaires
Parler-TTS — High-Quality Text-to-Speech Training and Inference Library
Parler-TTS by Hugging Face provides inference and training capabilities for high-quality text-to-speech models with natural prosody and controllable speaker attributes described in plain text.
Zonos — Multilingual TTS with Voice Cloning
Zonos is an open-weight TTS model trained on 200K+ hours of speech. 7.2K+ stars. Voice cloning, 5 languages, emotion control. Apache 2.0.
Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality
A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.
Deepgram Aura TTS — Text-to-Speech for Voice Agents
Deepgram Aura TTS produces natural English TTS with 250ms TTFA. Streaming WebSocket, 12 voices, tuned for conversational agents not narration.