How do I install Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

Introduction

Tortoise TTS is a text-to-speech system designed to produce high-quality, natural-sounding audio. It uses an autoregressive decoder paired with a diffusion model to generate speech that closely mimics human prosody, making it one of the most realistic open-source TTS systems available.

What Tortoise TTS Does

Converts text into natural-sounding speech using a multi-stage generative pipeline
Supports voice cloning from short reference audio clips (as few as 3 seconds)
Provides multiple quality presets trading speed for audio fidelity
Includes several built-in voices and supports custom voice creation
Generates speech with varied intonation and natural pauses

Architecture Overview

Tortoise uses a three-stage pipeline. First, an autoregressive Transformer generates discrete audio tokens from text, conditioned on voice embeddings extracted from reference clips. Next, a DDPM diffusion model refines these tokens into a mel spectrogram. Finally, a UnivNet vocoder converts the spectrogram to a raw waveform. This multi-stage approach prioritizes output quality over inference speed.

Self-Hosting & Configuration

Install via pip: pip install tortoise-tts with PyTorch and CUDA dependencies
Requires a GPU with at least 6 GB VRAM; runs on CPU but very slowly
Voice references stored as WAV files in the voices/ directory, organized by speaker name
Quality presets (ultra_fast, fast, standard, high_quality) control the number of diffusion steps
Run headless for batch processing or integrate into Python scripts via the API

Key Features

Among the most natural-sounding open-source TTS systems available
Voice cloning from minimal reference audio without fine-tuning
Multiple quality presets for different latency requirements
Built-in conditioning system for controlling emotion and speaking style
Fully offline operation with no API keys or cloud dependencies

Comparison with Similar Tools

Bark — supports music and sound effects alongside speech; Tortoise focuses purely on speech quality
Coqui TTS — broader model zoo and multilingual support; Tortoise offers superior single-speaker quality
StyleTTS 2 — faster inference with style-based synthesis; Tortoise produces richer prosody at the cost of speed
Fish Speech — optimized for multilingual real-time use; Tortoise prioritizes output naturalness
F5-TTS — flow matching approach with faster generation; Tortoise remains a benchmark for quality-first synthesis

FAQ

Q: How long does generation take? A: On an NVIDIA RTX 3090, the fast preset generates roughly 2 seconds of audio per second of wall time. The high_quality preset is 4-5x slower.

Q: Can I clone any voice? A: Tortoise can approximate a voice from 3-30 seconds of clean reference audio. More reference clips improve consistency and speaker similarity.

Q: Does it support languages other than English? A: Tortoise is primarily trained on English data. Community forks exist for other languages, but quality varies.

Q: Is Tortoise TTS suitable for real-time applications? A: No. The multi-stage pipeline is designed for offline batch generation. For real-time needs, consider lighter models like StyleTTS 2 or Kokoro.

Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

Introduction

What Tortoise TTS Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

AnimateDiff — Plug-and-Play Animation for Diffusion Models

Kohya sd-scripts — Training Scripts for Stable Diffusion and Flux

Petals — Run LLMs at Home BitTorrent-Style