F5-TTS — Flow Matching Text-to-Speech
F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.
What it is
F5-TTS is a diffusion transformer-based text-to-speech system using flow matching with ConvNeXt V2 architecture. It delivers multi-speaker and multi-style speech synthesis, voice chat powered by Qwen2.5-3B-Instruct, a Gradio web interface for inference and fine-tuning, and CLI inference.
With Triton and TensorRT-LLM optimization, F5-TTS achieves 0.0394 real-time factor on L20 GPU, making it one of the fastest open-source TTS systems available. MIT licensed code with CC-BY-NC pre-trained models.
How it saves time or tokens
F5-TTS provides a complete TTS pipeline in a single package: reference audio input, text input, and high-quality speech output. Traditional TTS setups require assembling multiple components (text normalization, acoustic model, vocoder). F5-TTS handles all stages internally. The Gradio UI enables quick prototyping without writing code. The CLI interface integrates into automation pipelines. Voice cloning from a reference audio sample eliminates the need for training custom voice models.
How to use
- Install F5-TTS:
pip install f5-tts
- CLI inference with a reference audio:
f5-tts_infer-cli \
--model F5TTS_v1_Base \
--ref_audio ref.wav \
--ref_text 'Reference text matching the audio' \
--gen_text 'Text you want to generate as speech'
- Launch the Gradio web UI:
f5-tts_infer-gradio
# For voice chat with Qwen2.5
f5-tts_infer-gradio --voicechat
Example
Using F5-TTS in a Python script for batch generation:
from f5_tts.api import F5TTS
tts = F5TTS(model_type='F5-TTS', ckpt_file='F5TTS_v1_Base')
# Generate speech from reference audio
wav, sr, _ = tts.infer(
ref_file='speaker_reference.wav',
ref_text='This is the reference transcript.',
gen_text='Generate this text in the same voice.',
seed=42
)
# Save output
import soundfile as sf
sf.write('output.wav', wav, sr)
# Batch generation with multiple texts
texts = [
'Welcome to the product demo.',
'Here are the key features.',
'Thank you for watching.'
]
for i, text in enumerate(texts):
wav, sr, _ = tts.infer(
ref_file='speaker_reference.wav',
ref_text='Reference text.',
gen_text=text
)
sf.write(f'segment_{i}.wav', wav, sr)
Related on TokRepo
- AI tools for voice — More text-to-speech and voice tools on TokRepo.
- Featured workflows — Discover curated AI tools.
Common pitfalls
- Reference audio quality directly affects output quality. Use clean, noise-free recordings of at least 5 seconds for the best voice cloning results.
- The CC-BY-NC license on pre-trained models restricts commercial use. Train your own models for commercial applications.
- Running F5-TTS without GPU acceleration is very slow. A CUDA-capable GPU is recommended for practical use.
Frequently Asked Questions
Flow matching is a diffusion-based generative method that trains a model to transform noise into speech spectrograms. Compared to traditional diffusion, flow matching provides faster inference with fewer denoising steps while maintaining high audio quality.
F5-TTS performs zero-shot voice cloning from a reference audio sample. Provide a short audio clip (5-15 seconds) and its transcript, and F5-TTS generates new speech in that voice. Quality depends on reference audio clarity.
For inference, a GPU with at least 4GB VRAM is recommended. The TensorRT-optimized version achieves 0.04 real-time factor on an L20 GPU. CPU inference works but is significantly slower.
The voice chat mode combines F5-TTS with Qwen2.5-3B-Instruct to create an interactive voice conversation system. You speak, the system transcribes your speech, generates a text response, and speaks it back.
The base model primarily supports English and Chinese. Community models extend support to other languages. Fine-tuning on your target language's data is supported via the Gradio UI.
Citations (3)
- F5-TTS GitHub— F5-TTS diffusion transformer TTS system
- arXiv paper— Flow matching for generative models
- ConvNeXt V2 paper— ConvNeXt V2 architecture
Related on TokRepo
Source & Thanks
Created by SWivid. Code: MIT, Models: CC-BY-NC. SWivid/F5-TTS — 14,300+ GitHub stars
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.