ChatTTS — Expressive Text-to-Speech for Dialogue
Generate natural conversational speech with laughter, pauses, and emotion. Optimized for dialogue scenarios. 39K+ GitHub stars.
Safe staging for this asset
This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.
npx -y tokrepo@latest install 101b6e58-6a37-48b9-a74e-639d32a0ee65 --target codexStages files first; activation requires review of the staged README and plan.
What it is
ChatTTS is an open-source text-to-speech model designed specifically for dialogue scenarios. Unlike standard TTS systems that produce flat, robotic speech, ChatTTS generates expressive audio with laughter, pauses, interjections, and emotional variation. It supports both English and Chinese, and can produce speech that sounds like a natural conversation.
It targets developers building chatbots, voice assistants, podcast generators, and any application where AI-generated speech needs to sound human and conversational.
How it saves time or tokens
ChatTTS eliminates the need for expensive commercial TTS APIs for conversational use cases. The model runs locally, so there are no per-character costs or API rate limits. It generates audio from text in seconds, and the expressive controls (laughter, pauses) are embedded via text tokens rather than requiring separate audio processing pipelines.
How to use
- Install and set up:
pip install ChatTTS
- Generate speech:
import ChatTTS
import torchaudio
chat = ChatTTS.Chat()
chat.load(compile=False) # Downloads model on first run
texts = ['Hey, have you tried this new AI tool? It is amazing.']
wavs = chat.infer(texts)
torchaudio.save('output.wav', wavs[0], 24000)
- Add expressiveness with control tokens:
# Use special tokens for laughter, pauses, etc.
texts = ['So I tried to deploy it and [laugh] it actually worked on the first try.']
wavs = chat.infer(texts)
torchaudio.save('expressive.wav', wavs[0], 24000)
Example
import ChatTTS
import torchaudio
import torch
chat = ChatTTS.Chat()
chat.load(compile=False)
# Generate a dialogue with different speakers
speaker_a = chat.sample_random_speaker()
speaker_b = chat.sample_random_speaker()
lines = [
('What do you think about using AI for code review?', speaker_a),
('Honestly? [laugh] It catches things I miss all the time.', speaker_b),
('Same here. The false positive rate is still annoying though.', speaker_a),
]
for text, speaker in lines:
params = ChatTTS.Chat.InferCodeParams(spk_emb=speaker)
wav = chat.infer([text], params_infer_code=params)
torchaudio.save(f'line_{lines.index((text, speaker))}.wav', wav[0], 24000)
Related on TokRepo
- AI tools for voice -- Voice synthesis and recognition tools
- AI tools for content -- Content creation and generation tools
Common pitfalls
- ChatTTS requires PyTorch and a GPU for fast inference. CPU inference works but is significantly slower. An NVIDIA GPU with at least 4GB VRAM is recommended.
- The model downloads on first run (several GB). Ensure you have adequate disk space and bandwidth for the initial setup.
- Audio quality varies by input text length and complexity. Very long texts should be split into sentences for best results.
Frequently Asked Questions
ChatTTS primarily supports English and Chinese. The model was trained on conversational data in both languages. Other languages may work with reduced quality but are not officially supported. Check the project repository for updates on language coverage.
Yes. ChatTTS supports speaker embedding. You can sample random speakers with sample_random_speaker() or save and reuse specific speaker embeddings for consistent voice across sessions. This lets you create distinct character voices for dialogue generation.
A GPU is strongly recommended. ChatTTS uses PyTorch and runs best on NVIDIA GPUs with CUDA support. CPU inference is possible but 5-10x slower. For production use, a GPU with at least 4GB VRAM provides real-time or near-real-time generation.
ChatTTS excels at conversational expressiveness -- laughter, pauses, and emotional variation -- which many commercial services handle poorly. Commercial services (ElevenLabs, Azure TTS) may offer higher raw audio quality and more voice options, but ChatTTS is free, runs locally, and has no API costs.
Yes. ChatTTS can be integrated into production apps via its Python API. For high-throughput scenarios, run it as a microservice behind an API endpoint. Be mindful of licensing -- check the project repository for the current license terms before commercial deployment.
Citations (3)
- ChatTTS GitHub Repository— ChatTTS is an open-source expressive TTS model for dialogue
- ChatTTS Documentation— ChatTTS supports control tokens for laughter and pauses
- Neural Speech Synthesis Survey— Text-to-speech models benefit from training on conversational data for natural p…
Related on TokRepo
Source & Thanks
Discussion
Related Assets
Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality
A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.
Chatterbox — State-of-the-Art Open Source Text-to-Speech
A high-quality open-source TTS model by Resemble AI that delivers natural-sounding speech with fine-grained control over prosody, emotion, and expressiveness.
DokuWiki — Simple Wiki That Stores Data in Plain Text Files
Lightweight wiki engine that requires no database. Uses plain text files for storage, making backups trivial and deployments simple. Ideal for documentation, knowledge bases, and internal team wikis.
Handy — Free Offline Speech-to-Text That Runs Anywhere
An open-source, cross-platform speech-to-text application built with Rust and Tauri that works completely offline with no cloud dependency.