Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 1, 2026·3 min de lecture

Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Tortoise TTS Overview
Commande d'installation directe
npx -y tokrepo@latest install 66712f72-453a-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

Tortoise TTS is a text-to-speech system designed to produce high-quality, natural-sounding audio. It uses an autoregressive decoder paired with a diffusion model to generate speech that closely mimics human prosody, making it one of the most realistic open-source TTS systems available.

What Tortoise TTS Does

  • Converts text into natural-sounding speech using a multi-stage generative pipeline
  • Supports voice cloning from short reference audio clips (as few as 3 seconds)
  • Provides multiple quality presets trading speed for audio fidelity
  • Includes several built-in voices and supports custom voice creation
  • Generates speech with varied intonation and natural pauses

Architecture Overview

Tortoise uses a three-stage pipeline. First, an autoregressive Transformer generates discrete audio tokens from text, conditioned on voice embeddings extracted from reference clips. Next, a DDPM diffusion model refines these tokens into a mel spectrogram. Finally, a UnivNet vocoder converts the spectrogram to a raw waveform. This multi-stage approach prioritizes output quality over inference speed.

Self-Hosting & Configuration

  • Install via pip: pip install tortoise-tts with PyTorch and CUDA dependencies
  • Requires a GPU with at least 6 GB VRAM; runs on CPU but very slowly
  • Voice references stored as WAV files in the voices/ directory, organized by speaker name
  • Quality presets (ultra_fast, fast, standard, high_quality) control the number of diffusion steps
  • Run headless for batch processing or integrate into Python scripts via the API

Key Features

  • Among the most natural-sounding open-source TTS systems available
  • Voice cloning from minimal reference audio without fine-tuning
  • Multiple quality presets for different latency requirements
  • Built-in conditioning system for controlling emotion and speaking style
  • Fully offline operation with no API keys or cloud dependencies

Comparison with Similar Tools

  • Bark — supports music and sound effects alongside speech; Tortoise focuses purely on speech quality
  • Coqui TTS — broader model zoo and multilingual support; Tortoise offers superior single-speaker quality
  • StyleTTS 2 — faster inference with style-based synthesis; Tortoise produces richer prosody at the cost of speed
  • Fish Speech — optimized for multilingual real-time use; Tortoise prioritizes output naturalness
  • F5-TTS — flow matching approach with faster generation; Tortoise remains a benchmark for quality-first synthesis

FAQ

Q: How long does generation take? A: On an NVIDIA RTX 3090, the fast preset generates roughly 2 seconds of audio per second of wall time. The high_quality preset is 4-5x slower.

Q: Can I clone any voice? A: Tortoise can approximate a voice from 3-30 seconds of clean reference audio. More reference clips improve consistency and speaker similarity.

Q: Does it support languages other than English? A: Tortoise is primarily trained on English data. Community forks exist for other languages, but quality varies.

Q: Is Tortoise TTS suitable for real-time applications? A: No. The multi-stage pipeline is designed for offline batch generation. For real-time needs, consider lighter models like StyleTTS 2 or Kokoro.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires