# Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality > A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality ## Quick Use ```bash pip install tortoise-tts python -m tortoise.do_tts --text "Hello world" --voice random --preset fast ``` ## Introduction Tortoise TTS is a text-to-speech system designed to produce high-quality, natural-sounding audio. It uses an autoregressive decoder paired with a diffusion model to generate speech that closely mimics human prosody, making it one of the most realistic open-source TTS systems available. ## What Tortoise TTS Does - Converts text into natural-sounding speech using a multi-stage generative pipeline - Supports voice cloning from short reference audio clips (as few as 3 seconds) - Provides multiple quality presets trading speed for audio fidelity - Includes several built-in voices and supports custom voice creation - Generates speech with varied intonation and natural pauses ## Architecture Overview Tortoise uses a three-stage pipeline. First, an autoregressive Transformer generates discrete audio tokens from text, conditioned on voice embeddings extracted from reference clips. Next, a DDPM diffusion model refines these tokens into a mel spectrogram. Finally, a UnivNet vocoder converts the spectrogram to a raw waveform. This multi-stage approach prioritizes output quality over inference speed. ## Self-Hosting & Configuration - Install via pip: `pip install tortoise-tts` with PyTorch and CUDA dependencies - Requires a GPU with at least 6 GB VRAM; runs on CPU but very slowly - Voice references stored as WAV files in the `voices/` directory, organized by speaker name - Quality presets (`ultra_fast`, `fast`, `standard`, `high_quality`) control the number of diffusion steps - Run headless for batch processing or integrate into Python scripts via the API ## Key Features - Among the most natural-sounding open-source TTS systems available - Voice cloning from minimal reference audio without fine-tuning - Multiple quality presets for different latency requirements - Built-in conditioning system for controlling emotion and speaking style - Fully offline operation with no API keys or cloud dependencies ## Comparison with Similar Tools - **Bark** — supports music and sound effects alongside speech; Tortoise focuses purely on speech quality - **Coqui TTS** — broader model zoo and multilingual support; Tortoise offers superior single-speaker quality - **StyleTTS 2** — faster inference with style-based synthesis; Tortoise produces richer prosody at the cost of speed - **Fish Speech** — optimized for multilingual real-time use; Tortoise prioritizes output naturalness - **F5-TTS** — flow matching approach with faster generation; Tortoise remains a benchmark for quality-first synthesis ## FAQ **Q: How long does generation take?** A: On an NVIDIA RTX 3090, the `fast` preset generates roughly 2 seconds of audio per second of wall time. The `high_quality` preset is 4-5x slower. **Q: Can I clone any voice?** A: Tortoise can approximate a voice from 3-30 seconds of clean reference audio. More reference clips improve consistency and speaker similarity. **Q: Does it support languages other than English?** A: Tortoise is primarily trained on English data. Community forks exist for other languages, but quality varies. **Q: Is Tortoise TTS suitable for real-time applications?** A: No. The multi-stage pipeline is designed for offline batch generation. For real-time needs, consider lighter models like StyleTTS 2 or Kokoro. ## Sources - https://github.com/neonbjb/tortoise-tts - https://nonint.com/static/tortoise_v2_examples.html --- Source: https://tokrepo.com/en/workflows/tortoise-tts-multi-voice-text-speech-focused-quality-66712f72 Author: AI Open Source