ConfigsMar 31, 2026·2 min read

Fish Speech — Multilingual TTS for 80+ Languages

Fish Speech is a state-of-the-art open-source TTS system supporting 80+ languages. 29K+ GitHub stars. 4B dual-AR model, voice cloning, emotional control with 15K+ tags, real-time inference.

TL;DR
Fish Speech is an open-source TTS system supporting 80+ languages with a 4B parameter model, voice cloning, and emotional control via 15K+ expression tags.
§01

What it is

Fish Speech is an open-source text-to-speech system that supports over 80 languages. It uses a 4B parameter dual-AR (autoregressive) model architecture for high-quality speech synthesis. Key features include voice cloning from short audio samples, emotional and stylistic control through 15K+ expression tags, and real-time inference speeds.

It targets developers building multilingual voice applications, content creators needing voiceovers in multiple languages, and researchers working on speech synthesis.

§02

How it saves time or tokens

Fish Speech consolidates multilingual TTS into a single model. Instead of using different TTS models for different languages, you use one model that handles 80+ languages. Voice cloning from a short sample eliminates the need for extensive voice recording sessions. The emotional control tags let you adjust tone and expression without re-recording.

§03

How to use

  1. Install:
pip install fish-speech
  1. Generate speech:
fish-speech tts 'Hello, this is Fish Speech!' --output hello.wav
  1. Voice cloning:
from fish_speech import FishSpeech

model = FishSpeech()
audio = model.generate(
    text='This is a cloned voice speaking.',
    reference_audio='reference.wav',
)
audio.save('cloned.wav')
  1. Run with Docker:
docker pull fishaudio/fish-speech
docker run -p 7860:7860 fishaudio/fish-speech
§04

Example

FeatureFish Speech
Languages80+
Model size4B parameters
Voice cloningYes, from short audio
Emotion control15K+ expression tags
Inference speedReal-time on GPU
Output formatWAV, MP3
LicenseApache 2.0
§05

Related on TokRepo

§06

Common pitfalls

  • The 4B model requires a GPU with at least 8GB VRAM. Smaller model variants are available for lower-end hardware but with reduced quality.
  • Voice cloning quality depends on the reference audio. Clean recordings of 10-30 seconds produce the best results. Background noise and multiple speakers in the reference degrade cloning quality.
  • Expression tags require familiarity with the tag vocabulary. Refer to the documentation for the full list of supported emotional and stylistic tags.

Frequently Asked Questions

How does Fish Speech compare to ElevenLabs?+

ElevenLabs is a commercial cloud service with excellent quality and ease of use. Fish Speech is open source and runs locally with no per-character costs. Fish Speech supports more languages (80+) and is free for commercial use under Apache 2.0. ElevenLabs offers a more polished API and higher baseline quality for English.

Can Fish Speech run on CPU?+

Fish Speech can run on CPU but inference is significantly slower than GPU. For real-time applications, a GPU is essential. For batch processing where speed is less critical, CPU inference is functional. The model benefits greatly from CUDA acceleration.

What is the voice cloning quality like?+

Voice cloning from a 10-30 second clean sample produces good speaker similarity. The cloned voice captures the general timbre and speaking style of the reference. It is not identical to the original speaker but is recognizable. Longer and cleaner reference audio improves quality.

Does Fish Speech support streaming?+

Yes. Fish Speech supports streaming inference where audio chunks are generated and played as they are produced. This enables real-time voice applications like chatbots and virtual assistants where low latency is important.

Is Fish Speech free for commercial use?+

Yes. Fish Speech is released under the Apache 2.0 license, which permits commercial use without fees or restrictions. You can use it in products, modify the code, and distribute derivatives. Model weights are also freely available.

Citations (3)
🙏

Source & Thanks

Created by Fish Audio. Research license. fishaudio/fish-speech — 29,000+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets