F5-TTS — Flow Matching Text-to-Speech
F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.
Installation agent prête
Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.
npx -y tokrepo@latest install 093755c4-a497-4f6d-9e00-4c41cbd49c90 --target codexÀ exécuter après confirmation du plan en dry-run.
What it is
F5-TTS is a diffusion transformer-based text-to-speech system using flow matching with ConvNeXt V2 architecture. It delivers multi-speaker and multi-style speech synthesis, voice chat powered by Qwen2.5-3B-Instruct, a Gradio web interface for inference and fine-tuning, and CLI inference.
With Triton and TensorRT-LLM optimization, F5-TTS achieves 0.0394 real-time factor on L20 GPU, making it one of the fastest open-source TTS systems available. MIT licensed code with CC-BY-NC pre-trained models.
How it saves time or tokens
F5-TTS provides a complete TTS pipeline in a single package: reference audio input, text input, and high-quality speech output. Traditional TTS setups require assembling multiple components (text normalization, acoustic model, vocoder). F5-TTS handles all stages internally. The Gradio UI enables quick prototyping without writing code. The CLI interface integrates into automation pipelines. Voice cloning from a reference audio sample eliminates the need for training custom voice models.
How to use
- Install F5-TTS:
pip install f5-tts
- CLI inference with a reference audio:
f5-tts_infer-cli \
--model F5TTS_v1_Base \
--ref_audio ref.wav \
--ref_text 'Reference text matching the audio' \
--gen_text 'Text you want to generate as speech'
- Launch the Gradio web UI:
f5-tts_infer-gradio
# For voice chat with Qwen2.5
f5-tts_infer-gradio --voicechat
Example
Using F5-TTS in a Python script for batch generation:
from f5_tts.api import F5TTS
tts = F5TTS(model_type='F5-TTS', ckpt_file='F5TTS_v1_Base')
# Generate speech from reference audio
wav, sr, _ = tts.infer(
ref_file='speaker_reference.wav',
ref_text='This is the reference transcript.',
gen_text='Generate this text in the same voice.',
seed=42
)
# Save output
import soundfile as sf
sf.write('output.wav', wav, sr)
# Batch generation with multiple texts
texts = [
'Welcome to the product demo.',
'Here are the key features.',
'Thank you for watching.'
]
for i, text in enumerate(texts):
wav, sr, _ = tts.infer(
ref_file='speaker_reference.wav',
ref_text='Reference text.',
gen_text=text
)
sf.write(f'segment_{i}.wav', wav, sr)
Related on TokRepo
- AI tools for voice — More text-to-speech and voice tools on TokRepo.
- Featured workflows — Discover curated AI tools.
Common pitfalls
- Reference audio quality directly affects output quality. Use clean, noise-free recordings of at least 5 seconds for the best voice cloning results.
- The CC-BY-NC license on pre-trained models restricts commercial use. Train your own models for commercial applications.
- Running F5-TTS without GPU acceleration is very slow. A CUDA-capable GPU is recommended for practical use.
Questions fréquentes
Flow matching is a diffusion-based generative method that trains a model to transform noise into speech spectrograms. Compared to traditional diffusion, flow matching provides faster inference with fewer denoising steps while maintaining high audio quality.
F5-TTS performs zero-shot voice cloning from a reference audio sample. Provide a short audio clip (5-15 seconds) and its transcript, and F5-TTS generates new speech in that voice. Quality depends on reference audio clarity.
For inference, a GPU with at least 4GB VRAM is recommended. The TensorRT-optimized version achieves 0.04 real-time factor on an L20 GPU. CPU inference works but is significantly slower.
The voice chat mode combines F5-TTS with Qwen2.5-3B-Instruct to create an interactive voice conversation system. You speak, the system transcribes your speech, generates a text response, and speaks it back.
The base model primarily supports English and Chinese. Community models extend support to other languages. Fine-tuning on your target language's data is supported via the Gradio UI.
Sources citées (3)
- F5-TTS GitHub— F5-TTS diffusion transformer TTS system
- arXiv paper— Flow matching for generative models
- ConvNeXt V2 paper— ConvNeXt V2 architecture
En lien sur TokRepo
Source et remerciements
Created by SWivid. Code: MIT, Models: CC-BY-NC. SWivid/F5-TTS — 14,300+ GitHub stars
Fil de discussion
Actifs similaires
Index TTS — Industrial Zero-Shot Text-to-Speech System
A controllable and efficient zero-shot text-to-speech system built for industrial use, supporting voice cloning and cross-lingual synthesis with high-quality output.
Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality
A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.
StyleTTS 2 — Human-Level Text-to-Speech via Style Diffusion
A TTS system that achieves human-level speech synthesis through style diffusion and adversarial training with large speech language models. Fast inference with natural prosody.
Parler-TTS — High-Quality Text-to-Speech Training and Inference Library
Parler-TTS by Hugging Face provides inference and training capabilities for high-quality text-to-speech models with natural prosody and controllable speaker attributes described in plain text.