ScriptsMay 18, 2026·3 min read

Chatterbox — State-of-the-Art Open Source Text-to-Speech

A high-quality open-source TTS model by Resemble AI that delivers natural-sounding speech with fine-grained control over prosody, emotion, and expressiveness.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
Chatterbox
Universal CLI install command
npx tokrepo install a6af5d44-5293-11f1-9bc6-00163e2b0d79

Introduction

Chatterbox is Resemble AI's open-source text-to-speech system that achieves state-of-the-art voice quality while remaining lightweight and easy to use. It generates natural, expressive speech from text with support for voice cloning, emotion control, and fine-grained prosody adjustments through a simple Python API.

What Chatterbox Does

  • Generates high-quality speech from text with natural prosody and intonation
  • Supports zero-shot voice cloning from a short reference audio clip
  • Provides control over emotion, pace, and expressiveness via text prompts
  • Runs inference on consumer GPUs with fast generation speeds
  • Offers a simple Python API with just a few lines of code to generate audio

Architecture Overview

Chatterbox uses a neural codec language model architecture that encodes speech into discrete tokens and generates them autoregressively conditioned on text input. The model combines a text encoder, a duration predictor, and a multi-stage token decoder that progressively refines audio quality. Voice cloning works by encoding a reference audio clip into a speaker embedding that conditions the generation process.

Self-Hosting & Configuration

  • Install via pip with CUDA-enabled PyTorch for GPU acceleration
  • Model weights are downloaded automatically from Hugging Face Hub on first run
  • Requires approximately 4GB of VRAM for inference on a single GPU
  • Supports batch generation for processing multiple utterances efficiently
  • Configuration options for sample rate, audio format, and generation temperature

Key Features

  • Near-human speech quality on standard TTS benchmarks
  • Zero-shot voice cloning from a 10-second reference clip
  • Controllable emotion and expressiveness through natural language descriptions
  • Fast inference suitable for real-time applications
  • Apache 2.0 license with no usage restrictions for commercial deployment

Comparison with Similar Tools

  • Bark — Multi-modal audio generation including music and effects; Chatterbox focuses on speech quality with better naturalness
  • Kokoro TTS — Lightweight 82M parameter model; Chatterbox offers higher fidelity at the cost of larger model size
  • F5-TTS — Flow-matching approach; Chatterbox uses codec language modeling for better prosody control
  • Fish Speech — Multilingual focus; Chatterbox prioritizes English speech quality and voice cloning accuracy

FAQ

Q: What languages does Chatterbox support? A: The initial release focuses on English, with community efforts underway for additional languages.

Q: Can I use Chatterbox commercially? A: Yes, the model is released under the Apache 2.0 license, which permits commercial use.

Q: How long does it take to generate speech? A: On a modern GPU, Chatterbox generates speech at roughly 10x real-time speed.

Q: Does voice cloning require training? A: No, voice cloning is zero-shot. Provide a short reference audio clip and the model adapts on the fly.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets