Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMar 31, 2026·2 min de lecture

F5-TTS — Flow Matching Text-to-Speech

F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
F5-TTS — Flow Matching Text-to-Speech
Commande d'installation directe
npx -y tokrepo@latest install 093755c4-a497-4f6d-9e00-4c41cbd49c90 --target codex

À exécuter après confirmation du plan en dry-run.

TL;DR
F5-TTS is a diffusion transformer TTS system with multi-speaker support and 0.04 real-time factor.
§01

What it is

F5-TTS is a diffusion transformer-based text-to-speech system using flow matching with ConvNeXt V2 architecture. It delivers multi-speaker and multi-style speech synthesis, voice chat powered by Qwen2.5-3B-Instruct, a Gradio web interface for inference and fine-tuning, and CLI inference.

With Triton and TensorRT-LLM optimization, F5-TTS achieves 0.0394 real-time factor on L20 GPU, making it one of the fastest open-source TTS systems available. MIT licensed code with CC-BY-NC pre-trained models.

§02

How it saves time or tokens

F5-TTS provides a complete TTS pipeline in a single package: reference audio input, text input, and high-quality speech output. Traditional TTS setups require assembling multiple components (text normalization, acoustic model, vocoder). F5-TTS handles all stages internally. The Gradio UI enables quick prototyping without writing code. The CLI interface integrates into automation pipelines. Voice cloning from a reference audio sample eliminates the need for training custom voice models.

§03

How to use

  1. Install F5-TTS:
pip install f5-tts
  1. CLI inference with a reference audio:
f5-tts_infer-cli \
  --model F5TTS_v1_Base \
  --ref_audio ref.wav \
  --ref_text 'Reference text matching the audio' \
  --gen_text 'Text you want to generate as speech'
  1. Launch the Gradio web UI:
f5-tts_infer-gradio
# For voice chat with Qwen2.5
f5-tts_infer-gradio --voicechat
§04

Example

Using F5-TTS in a Python script for batch generation:

from f5_tts.api import F5TTS

tts = F5TTS(model_type='F5-TTS', ckpt_file='F5TTS_v1_Base')

# Generate speech from reference audio
wav, sr, _ = tts.infer(
    ref_file='speaker_reference.wav',
    ref_text='This is the reference transcript.',
    gen_text='Generate this text in the same voice.',
    seed=42
)

# Save output
import soundfile as sf
sf.write('output.wav', wav, sr)

# Batch generation with multiple texts
texts = [
    'Welcome to the product demo.',
    'Here are the key features.',
    'Thank you for watching.'
]
for i, text in enumerate(texts):
    wav, sr, _ = tts.infer(
        ref_file='speaker_reference.wav',
        ref_text='Reference text.',
        gen_text=text
    )
    sf.write(f'segment_{i}.wav', wav, sr)
§05

Related on TokRepo

§06

Common pitfalls

  • Reference audio quality directly affects output quality. Use clean, noise-free recordings of at least 5 seconds for the best voice cloning results.
  • The CC-BY-NC license on pre-trained models restricts commercial use. Train your own models for commercial applications.
  • Running F5-TTS without GPU acceleration is very slow. A CUDA-capable GPU is recommended for practical use.

Questions fréquentes

What is flow matching in F5-TTS?+

Flow matching is a diffusion-based generative method that trains a model to transform noise into speech spectrograms. Compared to traditional diffusion, flow matching provides faster inference with fewer denoising steps while maintaining high audio quality.

Can F5-TTS clone any voice?+

F5-TTS performs zero-shot voice cloning from a reference audio sample. Provide a short audio clip (5-15 seconds) and its transcript, and F5-TTS generates new speech in that voice. Quality depends on reference audio clarity.

What hardware does F5-TTS require?+

For inference, a GPU with at least 4GB VRAM is recommended. The TensorRT-optimized version achieves 0.04 real-time factor on an L20 GPU. CPU inference works but is significantly slower.

What is the voice chat feature?+

The voice chat mode combines F5-TTS with Qwen2.5-3B-Instruct to create an interactive voice conversation system. You speak, the system transcribes your speech, generates a text response, and speaks it back.

Does F5-TTS support multiple languages?+

The base model primarily supports English and Chinese. Community models extend support to other languages. Fine-tuning on your target language's data is supported via the Gradio UI.

Sources citées (3)
🙏

Source et remerciements

Created by SWivid. Code: MIT, Models: CC-BY-NC. SWivid/F5-TTS — 14,300+ GitHub stars

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires