ScriptsMar 31, 2026·2 min read

F5-TTS — Flow Matching Text-to-Speech

F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.

TL;DR
F5-TTS is a diffusion transformer TTS system with multi-speaker support and 0.04 real-time factor.
§01

What it is

F5-TTS is a diffusion transformer-based text-to-speech system using flow matching with ConvNeXt V2 architecture. It delivers multi-speaker and multi-style speech synthesis, voice chat powered by Qwen2.5-3B-Instruct, a Gradio web interface for inference and fine-tuning, and CLI inference.

With Triton and TensorRT-LLM optimization, F5-TTS achieves 0.0394 real-time factor on L20 GPU, making it one of the fastest open-source TTS systems available. MIT licensed code with CC-BY-NC pre-trained models.

§02

How it saves time or tokens

F5-TTS provides a complete TTS pipeline in a single package: reference audio input, text input, and high-quality speech output. Traditional TTS setups require assembling multiple components (text normalization, acoustic model, vocoder). F5-TTS handles all stages internally. The Gradio UI enables quick prototyping without writing code. The CLI interface integrates into automation pipelines. Voice cloning from a reference audio sample eliminates the need for training custom voice models.

§03

How to use

  1. Install F5-TTS:
pip install f5-tts
  1. CLI inference with a reference audio:
f5-tts_infer-cli \
  --model F5TTS_v1_Base \
  --ref_audio ref.wav \
  --ref_text 'Reference text matching the audio' \
  --gen_text 'Text you want to generate as speech'
  1. Launch the Gradio web UI:
f5-tts_infer-gradio
# For voice chat with Qwen2.5
f5-tts_infer-gradio --voicechat
§04

Example

Using F5-TTS in a Python script for batch generation:

from f5_tts.api import F5TTS

tts = F5TTS(model_type='F5-TTS', ckpt_file='F5TTS_v1_Base')

# Generate speech from reference audio
wav, sr, _ = tts.infer(
    ref_file='speaker_reference.wav',
    ref_text='This is the reference transcript.',
    gen_text='Generate this text in the same voice.',
    seed=42
)

# Save output
import soundfile as sf
sf.write('output.wav', wav, sr)

# Batch generation with multiple texts
texts = [
    'Welcome to the product demo.',
    'Here are the key features.',
    'Thank you for watching.'
]
for i, text in enumerate(texts):
    wav, sr, _ = tts.infer(
        ref_file='speaker_reference.wav',
        ref_text='Reference text.',
        gen_text=text
    )
    sf.write(f'segment_{i}.wav', wav, sr)
§05

Related on TokRepo

§06

Common pitfalls

  • Reference audio quality directly affects output quality. Use clean, noise-free recordings of at least 5 seconds for the best voice cloning results.
  • The CC-BY-NC license on pre-trained models restricts commercial use. Train your own models for commercial applications.
  • Running F5-TTS without GPU acceleration is very slow. A CUDA-capable GPU is recommended for practical use.

Frequently Asked Questions

What is flow matching in F5-TTS?+

Flow matching is a diffusion-based generative method that trains a model to transform noise into speech spectrograms. Compared to traditional diffusion, flow matching provides faster inference with fewer denoising steps while maintaining high audio quality.

Can F5-TTS clone any voice?+

F5-TTS performs zero-shot voice cloning from a reference audio sample. Provide a short audio clip (5-15 seconds) and its transcript, and F5-TTS generates new speech in that voice. Quality depends on reference audio clarity.

What hardware does F5-TTS require?+

For inference, a GPU with at least 4GB VRAM is recommended. The TensorRT-optimized version achieves 0.04 real-time factor on an L20 GPU. CPU inference works but is significantly slower.

What is the voice chat feature?+

The voice chat mode combines F5-TTS with Qwen2.5-3B-Instruct to create an interactive voice conversation system. You speak, the system transcribes your speech, generates a text response, and speaks it back.

Does F5-TTS support multiple languages?+

The base model primarily supports English and Chinese. Community models extend support to other languages. Fine-tuning on your target language's data is supported via the Gradio UI.

Citations (3)
🙏

Source & Thanks

Created by SWivid. Code: MIT, Models: CC-BY-NC. SWivid/F5-TTS — 14,300+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets