Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsApr 2, 2026·2 min de lectura

ChatTTS — Expressive Text-to-Speech for Dialogue

Generate natural conversational speech with laughter, pauses, and emotion. Optimized for dialogue scenarios. 39K+ GitHub stars.

Introducción

ChatTTS is an open-source text-to-speech model with 39,000+ GitHub stars, specifically optimized for generating natural, expressive conversational speech. Unlike traditional TTS that sounds robotic, ChatTTS produces speech with natural laughter, pauses, filler words, and emotional variation — making it ideal for chatbots, virtual assistants, podcasts, and audiobooks. Trained on 100,000+ hours of dialogue data, it supports fine-grained prosody control through special tokens and generates 24kHz high-quality audio. Available in both English and Chinese.

Works with: Python, PyTorch, CUDA GPUs (recommended), CPU (slower). Best for developers building conversational AI that needs natural-sounding speech output. Setup time: under 5 minutes.


ChatTTS Features

Natural Dialogue Speech

ChatTTS excels at conversational scenarios:

Feature Description
Laughter Insert [laugh] for natural laughing
Pauses Control pause duration with [uv_break]
Filler words Natural "um", "uh" generation
Emotion Convey happiness, surprise, thoughtfulness
Prosody Pitch, speed, and emphasis control

Prosody Control

# Control speaking style with parameters
params_infer = ChatTTS.Chat.InferCodeParams(
    spk_emb=None,       # Speaker embedding (None = random)
    temperature=0.3,     # Lower = more stable, higher = more expressive
    top_P=0.7,
    top_K=20,
)

# Refine prosody
params_refine = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_2][laugh_0][break_6]',  # oral filler + no laugh + long breaks
)

wavs = chat.infer(
    texts,
    params_infer_code=params_infer,
    params_refine_text=params_refine,
)

Speaker Consistency

# Generate a random speaker
rand_spk = chat.sample_random_speaker()

# Use the same speaker for multiple utterances
params = ChatTTS.Chat.InferCodeParams(spk_emb=rand_spk)

wavs = chat.infer(
    ["First sentence.", "Second sentence.", "Third sentence."],
    params_infer_code=params,
)
# All 3 outputs sound like the same person

Performance

  • Speed: ~5x real-time on GPU (generates 5 seconds of audio per second)
  • Quality: 24kHz, natural prosody, MOS score competitive with commercial TTS
  • Languages: English and Chinese
  • Model size: ~800MB

Special Tokens

[laugh]     - Insert laughter
[uv_break]  - Insert a pause
[oral_0-9]  - Filler word frequency (0=none, 9=very frequent)
[laugh_0-9] - Laughter frequency
[break_0-9] - Pause frequency and duration

FAQ

Q: What is ChatTTS? A: ChatTTS is an open-source TTS model with 39,000+ GitHub stars, optimized for natural conversational speech with laughter, pauses, and emotion. Trained on 100K+ hours of dialogue data.

Q: How is ChatTTS different from Coqui TTS or Bark? A: ChatTTS is specifically optimized for dialogue — it excels at conversational prosody, laughter, and natural filler words. Coqui TTS is a general-purpose TTS toolkit. Bark generates creative audio but is slower. ChatTTS is the best choice for chatbot and assistant speech.

Q: Is ChatTTS free? A: Open-source under AGPL-3.0. Free for non-commercial use. Commercial use requires compliance with AGPL or a commercial license.


🙏

Fuente y agradecimientos

Created by 2noise. Licensed under AGPL-3.0.

ChatTTS — ⭐ 39,000+

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados