# Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

> A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

## Quick Use
```bash
pip install tortoise-tts
python -m tortoise.do_tts --text "Hello world" --voice random --preset fast
```

## Introduction
Tortoise TTS is a text-to-speech system designed to produce high-quality, natural-sounding audio. It uses an autoregressive decoder paired with a diffusion model to generate speech that closely mimics human prosody, making it one of the most realistic open-source TTS systems available.

## What Tortoise TTS Does
- Converts text into natural-sounding speech using a multi-stage generative pipeline
- Supports voice cloning from short reference audio clips (as few as 3 seconds)
- Provides multiple quality presets trading speed for audio fidelity
- Includes several built-in voices and supports custom voice creation
- Generates speech with varied intonation and natural pauses

## Architecture Overview
Tortoise uses a three-stage pipeline. First, an autoregressive Transformer generates discrete audio tokens from text, conditioned on voice embeddings extracted from reference clips. Next, a DDPM diffusion model refines these tokens into a mel spectrogram. Finally, a UnivNet vocoder converts the spectrogram to a raw waveform. This multi-stage approach prioritizes output quality over inference speed.

## Self-Hosting & Configuration
- Install via pip: `pip install tortoise-tts` with PyTorch and CUDA dependencies
- Requires a GPU with at least 6 GB VRAM; runs on CPU but very slowly
- Voice references stored as WAV files in the `voices/` directory, organized by speaker name
- Quality presets (`ultra_fast`, `fast`, `standard`, `high_quality`) control the number of diffusion steps
- Run headless for batch processing or integrate into Python scripts via the API

## Key Features
- Among the most natural-sounding open-source TTS systems available
- Voice cloning from minimal reference audio without fine-tuning
- Multiple quality presets for different latency requirements
- Built-in conditioning system for controlling emotion and speaking style
- Fully offline operation with no API keys or cloud dependencies

## Comparison with Similar Tools
- **Bark** — supports music and sound effects alongside speech; Tortoise focuses purely on speech quality
- **Coqui TTS** — broader model zoo and multilingual support; Tortoise offers superior single-speaker quality
- **StyleTTS 2** — faster inference with style-based synthesis; Tortoise produces richer prosody at the cost of speed
- **Fish Speech** — optimized for multilingual real-time use; Tortoise prioritizes output naturalness
- **F5-TTS** — flow matching approach with faster generation; Tortoise remains a benchmark for quality-first synthesis

## FAQ
**Q: How long does generation take?**
A: On an NVIDIA RTX 3090, the `fast` preset generates roughly 2 seconds of audio per second of wall time. The `high_quality` preset is 4-5x slower.

**Q: Can I clone any voice?**
A: Tortoise can approximate a voice from 3-30 seconds of clean reference audio. More reference clips improve consistency and speaker similarity.

**Q: Does it support languages other than English?**
A: Tortoise is primarily trained on English data. Community forks exist for other languages, but quality varies.

**Q: Is Tortoise TTS suitable for real-time applications?**
A: No. The multi-stage pipeline is designed for offline batch generation. For real-time needs, consider lighter models like StyleTTS 2 or Kokoro.

## Sources
- https://github.com/neonbjb/tortoise-tts
- https://nonint.com/static/tortoise_v2_examples.html

---
Source: https://tokrepo.com/en/workflows/tortoise-tts-multi-voice-text-speech-focused-quality-66712f72
Author: AI Open Source