SkillsMar 31, 2026·2 min read

Fish Speech — Multilingual TTS for 80+ Languages

Fish Speech is a state-of-the-art open-source TTS system supporting 80+ languages. 29K+ GitHub stars. 4B dual-AR model, voice cloning, emotional control with 15K+ tags, real-time inference.

AI Open Source · Community

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow

Agent surface

Any MCP/CLI agent

Kind

Skill

Install

Single

Trust

Trust: Established

Entrypoint

Fish Speech — Multilingual TTS for 80+ Languages

Direct install command

npx -y tokrepo@latest install 88c15e9c-439c-4e70-8b8f-cd04efe928c0 --target codex

Run after dry-run confirms the install plan.

TL;DR

Fish Speech is an open-source TTS system supporting 80+ languages with a 4B parameter model, voice cloning, and emotional control via 15K+ expression tags.

§01

What it is

Fish Speech is an open-source text-to-speech system that supports over 80 languages. It uses a 4B parameter dual-AR (autoregressive) model architecture for high-quality speech synthesis. Key features include voice cloning from short audio samples, emotional and stylistic control through 15K+ expression tags, and real-time inference speeds.

It targets developers building multilingual voice applications, content creators needing voiceovers in multiple languages, and researchers working on speech synthesis.

§02

How it saves time or tokens

Fish Speech consolidates multilingual TTS into a single model. Instead of using different TTS models for different languages, you use one model that handles 80+ languages. Voice cloning from a short sample eliminates the need for extensive voice recording sessions. The emotional control tags let you adjust tone and expression without re-recording.

§03

How to use

Install:

pip install fish-speech

Generate speech:

fish-speech tts 'Hello, this is Fish Speech!' --output hello.wav

Voice cloning:

from fish_speech import FishSpeech

model = FishSpeech()
audio = model.generate(
    text='This is a cloned voice speaking.',
    reference_audio='reference.wav',
)
audio.save('cloned.wav')

Run with Docker:

docker pull fishaudio/fish-speech
docker run -p 7860:7860 fishaudio/fish-speech

§04

Example

Feature	Fish Speech
Languages	80+
Model size	4B parameters
Voice cloning	Yes, from short audio
Emotion control	15K+ expression tags
Inference speed	Real-time on GPU
Output format	WAV, MP3
License	Apache 2.0

§05

Related on TokRepo

AI tools for voice -- voice AI and TTS tools
AI tools for content -- content creation tools

§06

Common pitfalls

The 4B model requires a GPU with at least 8GB VRAM. Smaller model variants are available for lower-end hardware but with reduced quality.
Voice cloning quality depends on the reference audio. Clean recordings of 10-30 seconds produce the best results. Background noise and multiple speakers in the reference degrade cloning quality.
Expression tags require familiarity with the tag vocabulary. Refer to the documentation for the full list of supported emotional and stylistic tags.

Frequently Asked Questions

How does Fish Speech compare to ElevenLabs?+

ElevenLabs is a commercial cloud service with excellent quality and ease of use. Fish Speech is open source and runs locally with no per-character costs. Fish Speech supports more languages (80+) and is free for commercial use under Apache 2.0. ElevenLabs offers a more polished API and higher baseline quality for English.

Can Fish Speech run on CPU?+

Fish Speech can run on CPU but inference is significantly slower than GPU. For real-time applications, a GPU is essential. For batch processing where speed is less critical, CPU inference is functional. The model benefits greatly from CUDA acceleration.

What is the voice cloning quality like?+

Voice cloning from a 10-30 second clean sample produces good speaker similarity. The cloned voice captures the general timbre and speaking style of the reference. It is not identical to the original speaker but is recognizable. Longer and cleaner reference audio improves quality.

Does Fish Speech support streaming?+

Yes. Fish Speech supports streaming inference where audio chunks are generated and played as they are produced. This enables real-time voice applications like chatbots and virtual assistants where low latency is important.

Is Fish Speech free for commercial use?+

Yes. Fish Speech is released under the Apache 2.0 license, which permits commercial use without fees or restrictions. You can use it in products, modify the code, and distribute derivatives. Model weights are also freely available.

Citations (3)

Fish Speech GitHub— Fish Speech repository
Fish Speech Docs— Fish Speech documentation
VALL-E X Paper (arXiv)— Autoregressive speech synthesis

Related on TokRepo

Voice AI tools Content tools Featured workflows

🙏

Source & Thanks

Created by Fish Audio. Research license. fishaudio/fish-speech — 29,000+ GitHub stars

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control.

Skills

AI Open Source

SenseVoice — Multilingual Speech Understanding Model

SenseVoice is an open-source speech foundation model by Alibaba's FunAudioLLM team that performs automatic speech recognition, language identification, speech emotion recognition, and audio event detection in a single model. It supports 50+ languages and runs significantly faster than Whisper.

Skills

AI Open Source

Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.

Skills

AI Open Source

GPT-SoVITS — Few-Shot Voice Cloning and Text-to-Speech

An open-source TTS system that can clone any voice from just one minute of audio data, combining GPT-style language modeling with VITS synthesis for natural speech generation.

Skills

AI Open Source