# Chatterbox — State-of-the-Art Open Source Text-to-Speech > A high-quality open-source TTS model by Resemble AI that delivers natural-sounding speech with fine-grained control over prosody, emotion, and expressiveness. ## Install Save as a script file and run: # Chatterbox — State-of-the-Art Open Source Text-to-Speech ## Quick Use ```bash pip install chatterbox-tts python -c " from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device='cuda') wav = model.generate('Hello, welcome to Chatterbox TTS.') import torchaudio torchaudio.save('output.wav', wav, model.sr) " ``` ## Introduction Chatterbox is Resemble AI's open-source text-to-speech system that achieves state-of-the-art voice quality while remaining lightweight and easy to use. It generates natural, expressive speech from text with support for voice cloning, emotion control, and fine-grained prosody adjustments through a simple Python API. ## What Chatterbox Does - Generates high-quality speech from text with natural prosody and intonation - Supports zero-shot voice cloning from a short reference audio clip - Provides control over emotion, pace, and expressiveness via text prompts - Runs inference on consumer GPUs with fast generation speeds - Offers a simple Python API with just a few lines of code to generate audio ## Architecture Overview Chatterbox uses a neural codec language model architecture that encodes speech into discrete tokens and generates them autoregressively conditioned on text input. The model combines a text encoder, a duration predictor, and a multi-stage token decoder that progressively refines audio quality. Voice cloning works by encoding a reference audio clip into a speaker embedding that conditions the generation process. ## Self-Hosting & Configuration - Install via pip with CUDA-enabled PyTorch for GPU acceleration - Model weights are downloaded automatically from Hugging Face Hub on first run - Requires approximately 4GB of VRAM for inference on a single GPU - Supports batch generation for processing multiple utterances efficiently - Configuration options for sample rate, audio format, and generation temperature ## Key Features - Near-human speech quality on standard TTS benchmarks - Zero-shot voice cloning from a 10-second reference clip - Controllable emotion and expressiveness through natural language descriptions - Fast inference suitable for real-time applications - Apache 2.0 license with no usage restrictions for commercial deployment ## Comparison with Similar Tools - **Bark** — Multi-modal audio generation including music and effects; Chatterbox focuses on speech quality with better naturalness - **Kokoro TTS** — Lightweight 82M parameter model; Chatterbox offers higher fidelity at the cost of larger model size - **F5-TTS** — Flow-matching approach; Chatterbox uses codec language modeling for better prosody control - **Fish Speech** — Multilingual focus; Chatterbox prioritizes English speech quality and voice cloning accuracy ## FAQ **Q: What languages does Chatterbox support?** A: The initial release focuses on English, with community efforts underway for additional languages. **Q: Can I use Chatterbox commercially?** A: Yes, the model is released under the Apache 2.0 license, which permits commercial use. **Q: How long does it take to generate speech?** A: On a modern GPU, Chatterbox generates speech at roughly 10x real-time speed. **Q: Does voice cloning require training?** A: No, voice cloning is zero-shot. Provide a short reference audio clip and the model adapts on the fly. ## Sources - https://github.com/resemble-ai/chatterbox - https://www.resemble.ai/chatterbox --- Source: https://tokrepo.com/en/workflows/asset-a6af5d44 Author: Script Depot