Cette page est affichée en anglais. Une traduction française est en cours.

SkillsMar 31, 2026·2 min de lecture

F5-TTS — Flow Matching Text-to-Speech

F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.

Script Depot · Community

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser

Surface agent

Tout agent MCP/CLI

Type

Skill

Installation

Single

Confiance

Confiance : Established

Point d'entrée

F5-TTS — Flow Matching Text-to-Speech

Commande d'installation directe

npx -y tokrepo@latest install 093755c4-a497-4f6d-9e00-4c41cbd49c90 --target codex

À exécuter après confirmation du plan en dry-run.

TL;DR

F5-TTS is a diffusion transformer TTS system with multi-speaker support and 0.04 real-time factor.

§01

What it is

F5-TTS is a diffusion transformer-based text-to-speech system using flow matching with ConvNeXt V2 architecture. It delivers multi-speaker and multi-style speech synthesis, voice chat powered by Qwen2.5-3B-Instruct, a Gradio web interface for inference and fine-tuning, and CLI inference.

With Triton and TensorRT-LLM optimization, F5-TTS achieves 0.0394 real-time factor on L20 GPU, making it one of the fastest open-source TTS systems available. MIT licensed code with CC-BY-NC pre-trained models.

§02

How it saves time or tokens

F5-TTS provides a complete TTS pipeline in a single package: reference audio input, text input, and high-quality speech output. Traditional TTS setups require assembling multiple components (text normalization, acoustic model, vocoder). F5-TTS handles all stages internally. The Gradio UI enables quick prototyping without writing code. The CLI interface integrates into automation pipelines. Voice cloning from a reference audio sample eliminates the need for training custom voice models.

§03

How to use

Install F5-TTS:

pip install f5-tts

CLI inference with a reference audio:

f5-tts_infer-cli \
  --model F5TTS_v1_Base \
  --ref_audio ref.wav \
  --ref_text 'Reference text matching the audio' \
  --gen_text 'Text you want to generate as speech'

Launch the Gradio web UI:

f5-tts_infer-gradio
# For voice chat with Qwen2.5
f5-tts_infer-gradio --voicechat

§04

Example

Using F5-TTS in a Python script for batch generation:

from f5_tts.api import F5TTS

tts = F5TTS(model_type='F5-TTS', ckpt_file='F5TTS_v1_Base')

# Generate speech from reference audio
wav, sr, _ = tts.infer(
    ref_file='speaker_reference.wav',
    ref_text='This is the reference transcript.',
    gen_text='Generate this text in the same voice.',
    seed=42
)

# Save output
import soundfile as sf
sf.write('output.wav', wav, sr)

# Batch generation with multiple texts
texts = [
    'Welcome to the product demo.',
    'Here are the key features.',
    'Thank you for watching.'
]
for i, text in enumerate(texts):
    wav, sr, _ = tts.infer(
        ref_file='speaker_reference.wav',
        ref_text='Reference text.',
        gen_text=text
    )
    sf.write(f'segment_{i}.wav', wav, sr)

§05

Related on TokRepo

AI tools for voice — More text-to-speech and voice tools on TokRepo.
Featured workflows — Discover curated AI tools.

§06

Common pitfalls

Reference audio quality directly affects output quality. Use clean, noise-free recordings of at least 5 seconds for the best voice cloning results.
The CC-BY-NC license on pre-trained models restricts commercial use. Train your own models for commercial applications.
Running F5-TTS without GPU acceleration is very slow. A CUDA-capable GPU is recommended for practical use.

Questions fréquentes

What is flow matching in F5-TTS?+

Flow matching is a diffusion-based generative method that trains a model to transform noise into speech spectrograms. Compared to traditional diffusion, flow matching provides faster inference with fewer denoising steps while maintaining high audio quality.

Can F5-TTS clone any voice?+

F5-TTS performs zero-shot voice cloning from a reference audio sample. Provide a short audio clip (5-15 seconds) and its transcript, and F5-TTS generates new speech in that voice. Quality depends on reference audio clarity.

What hardware does F5-TTS require?+

For inference, a GPU with at least 4GB VRAM is recommended. The TensorRT-optimized version achieves 0.04 real-time factor on an L20 GPU. CPU inference works but is significantly slower.

What is the voice chat feature?+

The voice chat mode combines F5-TTS with Qwen2.5-3B-Instruct to create an interactive voice conversation system. You speak, the system transcribes your speech, generates a text response, and speaks it back.

Does F5-TTS support multiple languages?+

The base model primarily supports English and Chinese. Community models extend support to other languages. Fine-tuning on your target language's data is supported via the Gradio UI.

Sources citées (3)

F5-TTS GitHub— F5-TTS diffusion transformer TTS system
arXiv paper— Flow matching for generative models
ConvNeXt V2 paper— ConvNeXt V2 architecture

En lien sur TokRepo

Voice tools Featured workflows Coding tools

🙏

Source et remerciements

Created by SWivid. Code: MIT, Models: CC-BY-NC. SWivid/F5-TTS — 14,300+ GitHub stars

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Index TTS — Industrial Zero-Shot Text-to-Speech System

A controllable and efficient zero-shot text-to-speech system built for industrial use, supporting voice cloning and cross-lingual synthesis with high-quality output.

Skills

Script Depot

Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.

Skills

AI Open Source

StyleTTS 2 — Human-Level Text-to-Speech via Style Diffusion

A TTS system that achieves human-level speech synthesis through style diffusion and adversarial training with large speech language models. Fast inference with natural prosody.

Skills

Script Depot

Parler-TTS — High-Quality Text-to-Speech Training and Inference Library

Parler-TTS by Hugging Face provides inference and training capabilities for high-quality text-to-speech models with natural prosody and controllable speaker attributes described in plain text.

Scripts

Script Depot