Is F5-TTS — Flow Matching Text-to-Speech free to use?

Yes. F5-TTS — Flow Matching Text-to-Speech is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install F5-TTS — Flow Matching Text-to-Speech?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ScriptsMar 31, 2026·2 min read

F5-TTS — Flow Matching Text-to-Speech

F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.

Script Depot · Community

TL;DR

F5-TTS is a diffusion transformer TTS system with multi-speaker support and 0.04 real-time factor.

§01

What it is

F5-TTS is a diffusion transformer-based text-to-speech system using flow matching with ConvNeXt V2 architecture. It delivers multi-speaker and multi-style speech synthesis, voice chat powered by Qwen2.5-3B-Instruct, a Gradio web interface for inference and fine-tuning, and CLI inference.

With Triton and TensorRT-LLM optimization, F5-TTS achieves 0.0394 real-time factor on L20 GPU, making it one of the fastest open-source TTS systems available. MIT licensed code with CC-BY-NC pre-trained models.

§02

How it saves time or tokens

F5-TTS provides a complete TTS pipeline in a single package: reference audio input, text input, and high-quality speech output. Traditional TTS setups require assembling multiple components (text normalization, acoustic model, vocoder). F5-TTS handles all stages internally. The Gradio UI enables quick prototyping without writing code. The CLI interface integrates into automation pipelines. Voice cloning from a reference audio sample eliminates the need for training custom voice models.

§03

How to use

Install F5-TTS:

pip install f5-tts

CLI inference with a reference audio:

f5-tts_infer-cli \
  --model F5TTS_v1_Base \
  --ref_audio ref.wav \
  --ref_text 'Reference text matching the audio' \
  --gen_text 'Text you want to generate as speech'

Launch the Gradio web UI:

f5-tts_infer-gradio
# For voice chat with Qwen2.5
f5-tts_infer-gradio --voicechat

§04

Example

Using F5-TTS in a Python script for batch generation:

from f5_tts.api import F5TTS

tts = F5TTS(model_type='F5-TTS', ckpt_file='F5TTS_v1_Base')

# Generate speech from reference audio
wav, sr, _ = tts.infer(
    ref_file='speaker_reference.wav',
    ref_text='This is the reference transcript.',
    gen_text='Generate this text in the same voice.',
    seed=42
)

# Save output
import soundfile as sf
sf.write('output.wav', wav, sr)

# Batch generation with multiple texts
texts = [
    'Welcome to the product demo.',
    'Here are the key features.',
    'Thank you for watching.'
]
for i, text in enumerate(texts):
    wav, sr, _ = tts.infer(
        ref_file='speaker_reference.wav',
        ref_text='Reference text.',
        gen_text=text
    )
    sf.write(f'segment_{i}.wav', wav, sr)

§05

Related on TokRepo

AI tools for voice — More text-to-speech and voice tools on TokRepo.
Featured workflows — Discover curated AI tools.

§06

Common pitfalls

Reference audio quality directly affects output quality. Use clean, noise-free recordings of at least 5 seconds for the best voice cloning results.
The CC-BY-NC license on pre-trained models restricts commercial use. Train your own models for commercial applications.
Running F5-TTS without GPU acceleration is very slow. A CUDA-capable GPU is recommended for practical use.

Frequently Asked Questions

What is flow matching in F5-TTS?+

Flow matching is a diffusion-based generative method that trains a model to transform noise into speech spectrograms. Compared to traditional diffusion, flow matching provides faster inference with fewer denoising steps while maintaining high audio quality.

Can F5-TTS clone any voice?+

F5-TTS performs zero-shot voice cloning from a reference audio sample. Provide a short audio clip (5-15 seconds) and its transcript, and F5-TTS generates new speech in that voice. Quality depends on reference audio clarity.

What hardware does F5-TTS require?+

For inference, a GPU with at least 4GB VRAM is recommended. The TensorRT-optimized version achieves 0.04 real-time factor on an L20 GPU. CPU inference works but is significantly slower.

What is the voice chat feature?+

The voice chat mode combines F5-TTS with Qwen2.5-3B-Instruct to create an interactive voice conversation system. You speak, the system transcribes your speech, generates a text response, and speaks it back.

Does F5-TTS support multiple languages?+

The base model primarily supports English and Chinese. Community models extend support to other languages. Fine-tuning on your target language's data is supported via the Gradio UI.

Citations (3)

F5-TTS GitHub— F5-TTS diffusion transformer TTS system
arXiv paper— Flow matching for generative models
ConvNeXt V2 paper— ConvNeXt V2 architecture

Related on TokRepo

Voice tools Featured workflows Coding tools

🙏

Source & Thanks

Created by SWivid. Code: MIT, Models: CC-BY-NC. SWivid/F5-TTS — 14,300+ GitHub stars

Discussion

No comments yet. Be the first to share your thoughts.

F5-TTS — Flow Matching Text-to-Speech

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

Frequently Asked Questions

Citations (3)

Related on TokRepo

Source & Thanks

Discussion

Related Assets

NAPI-RS — Build Node.js Native Addons in Rust

Mamba — Fast Cross-Platform Package Manager

Plasmo — The Browser Extension Framework