Fish Speech — Multilingual TTS for 80+ Languages
Fish Speech is a state-of-the-art open-source TTS system supporting 80+ languages. 29K+ GitHub stars. 4B dual-AR model, voice cloning, emotional control with 15K+ tags, real-time inference.
What it is
Fish Speech is an open-source text-to-speech system that supports over 80 languages. It uses a 4B parameter dual-AR (autoregressive) model architecture for high-quality speech synthesis. Key features include voice cloning from short audio samples, emotional and stylistic control through 15K+ expression tags, and real-time inference speeds.
It targets developers building multilingual voice applications, content creators needing voiceovers in multiple languages, and researchers working on speech synthesis.
How it saves time or tokens
Fish Speech consolidates multilingual TTS into a single model. Instead of using different TTS models for different languages, you use one model that handles 80+ languages. Voice cloning from a short sample eliminates the need for extensive voice recording sessions. The emotional control tags let you adjust tone and expression without re-recording.
How to use
- Install:
pip install fish-speech
- Generate speech:
fish-speech tts 'Hello, this is Fish Speech!' --output hello.wav
- Voice cloning:
from fish_speech import FishSpeech
model = FishSpeech()
audio = model.generate(
text='This is a cloned voice speaking.',
reference_audio='reference.wav',
)
audio.save('cloned.wav')
- Run with Docker:
docker pull fishaudio/fish-speech
docker run -p 7860:7860 fishaudio/fish-speech
Example
| Feature | Fish Speech |
|---|---|
| Languages | 80+ |
| Model size | 4B parameters |
| Voice cloning | Yes, from short audio |
| Emotion control | 15K+ expression tags |
| Inference speed | Real-time on GPU |
| Output format | WAV, MP3 |
| License | Apache 2.0 |
Related on TokRepo
- AI tools for voice -- voice AI and TTS tools
- AI tools for content -- content creation tools
Common pitfalls
- The 4B model requires a GPU with at least 8GB VRAM. Smaller model variants are available for lower-end hardware but with reduced quality.
- Voice cloning quality depends on the reference audio. Clean recordings of 10-30 seconds produce the best results. Background noise and multiple speakers in the reference degrade cloning quality.
- Expression tags require familiarity with the tag vocabulary. Refer to the documentation for the full list of supported emotional and stylistic tags.
Frequently Asked Questions
ElevenLabs is a commercial cloud service with excellent quality and ease of use. Fish Speech is open source and runs locally with no per-character costs. Fish Speech supports more languages (80+) and is free for commercial use under Apache 2.0. ElevenLabs offers a more polished API and higher baseline quality for English.
Fish Speech can run on CPU but inference is significantly slower than GPU. For real-time applications, a GPU is essential. For batch processing where speed is less critical, CPU inference is functional. The model benefits greatly from CUDA acceleration.
Voice cloning from a 10-30 second clean sample produces good speaker similarity. The cloned voice captures the general timbre and speaking style of the reference. It is not identical to the original speaker but is recognizable. Longer and cleaner reference audio improves quality.
Yes. Fish Speech supports streaming inference where audio chunks are generated and played as they are produced. This enables real-time voice applications like chatbots and virtual assistants where low latency is important.
Yes. Fish Speech is released under the Apache 2.0 license, which permits commercial use without fees or restrictions. You can use it in products, modify the code, and distribute derivatives. Model weights are also freely available.
Citations (3)
- Fish Speech GitHub— Fish Speech repository
- Fish Speech Docs— Fish Speech documentation
- VALL-E X Paper (arXiv)— Autoregressive speech synthesis
Related on TokRepo
Source & Thanks
Created by Fish Audio. Research license. fishaudio/fish-speech — 29,000+ GitHub stars
Discussion
Related Assets
Conda — Cross-Platform Package and Environment Manager
Install, update, and manage packages and isolated environments for Python, R, C/C++, and hundreds of other languages from a single tool.
Sphinx — Python Documentation Generator
Generate professional documentation from reStructuredText and Markdown with cross-references, API autodoc, and multiple output formats.
Neutralinojs — Lightweight Cross-Platform Desktop Apps
Build desktop applications with HTML, CSS, and JavaScript using a tiny native runtime instead of bundling Chromium.