Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 13, 2026·3 min de lecture

GPT-SoVITS — Few-Shot Voice Cloning and Text-to-Speech

An open-source TTS system that can clone any voice from just one minute of audio data, combining GPT-style language modeling with VITS synthesis for natural speech generation.

Introduction

GPT-SoVITS is an open-source text-to-speech system that achieves voice cloning from as little as one minute of reference audio. It combines GPT-based language modeling for prosody with VITS (Variational Inference with adversarial learning for end-to-end TTS) for high-quality waveform synthesis.

What GPT-SoVITS Does

  • Clones a speaker's voice from 1-10 minutes of reference audio recordings
  • Generates natural-sounding speech in the cloned voice from text input
  • Supports cross-lingual voice cloning across Chinese, English, and Japanese
  • Provides a web UI for training, inference, and audio management
  • Includes tools for dataset preparation, annotation, and audio preprocessing

Architecture Overview

GPT-SoVITS uses a two-stage pipeline. First, a GPT-based model predicts semantic tokens from text, capturing prosody and rhythm. Then a VITS-based model converts these tokens into a high-fidelity waveform matching the target speaker's voice characteristics. Speaker embedding is extracted from reference audio using a pretrained encoder, enabling few-shot adaptation.

Self-Hosting & Configuration

  • Requires Python 3.9+ with PyTorch and CUDA for GPU-accelerated training and inference
  • Pretrained base models are downloaded automatically on first run
  • Training a voice clone takes 30-60 minutes on a consumer GPU with 1 minute of audio
  • The web UI runs locally with no external API dependencies
  • Supports CPU-only inference at reduced speed for machines without GPUs

Key Features

  • One-minute voice cloning produces recognizable speaker identity and style
  • Cross-lingual synthesis supports Chinese, English, and Japanese text
  • Built-in dataset tools handle audio slicing, denoising, and automatic transcription
  • Fine-tuning from pretrained models converges quickly even on consumer hardware
  • Batch inference mode for generating large volumes of audio efficiently

Comparison with Similar Tools

  • Bark — generates speech with music and effects; GPT-SoVITS specializes in voice cloning fidelity
  • Coqui TTS — broader TTS toolkit; GPT-SoVITS achieves better few-shot cloning quality
  • Fish Speech — multilingual TTS; GPT-SoVITS offers a more mature training pipeline
  • F5-TTS — flow-matching approach; GPT-SoVITS uses GPT + VITS with established community support
  • Kokoro — lightweight TTS; GPT-SoVITS provides deeper voice cloning from minimal data

FAQ

Q: How much audio data is needed to clone a voice? A: As little as 1 minute for basic cloning, though 5-10 minutes yields better results.

Q: Can it run on CPU only? A: Yes, inference works on CPU but is significantly slower. Training requires a CUDA GPU.

Q: Is the output suitable for production use? A: Quality is high for many use cases. Evaluate on your specific requirements.

Q: What audio formats are supported? A: WAV is the primary format. MP3 and other formats are converted automatically during preprocessing.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires