Esta página se muestra en inglés. Una traducción al español está en curso.

ScriptsApr 1, 2026·1 min de lectura

Zonos — Multilingual TTS with Voice Cloning

Zonos is an open-weight TTS model trained on 200K+ hours of speech. 7.2K+ stars. Voice cloning, 5 languages, emotion control. Apache 2.0.

Script Depot · Community

Listo para agents

Staging seguro para este activo

Este activo primero queda en staging. El prompt copiado pide inspeccionar los archivos staged antes de activar scripts, config MCP o config global.

Stage only · 17/100Política: staging

Superficie agent

Cualquier agent MCP/CLI

Tipo

Script

Instalación

Stage only

Confianza

Confianza: Established

Entrada

zonos.md

Comando de staging seguro

npx -y tokrepo@latest install 9b6992d2-2369-45f0-9f8e-6c0c834c649b --target codex

Primero deja archivos en staging; la activación requiere revisar el README y el plan staged.

TL;DR

Zonos generates natural speech from text with zero-shot voice cloning across 5 languages and fine-grained emotion control.

§01

What it is

Zonos is an open-weight text-to-speech model by Zyphra, trained on more than 200,000 hours of multilingual speech data. It generates natural-sounding speech from text with zero-shot voice cloning from brief audio samples. Zonos supports English, Japanese, Chinese, French, and German, with controls for speaking rate, pitch, emotion, and audio quality.

Zonos targets developers building multilingual voice applications, accessibility tools, and content creation pipelines. It runs locally on GPUs with 6GB+ VRAM and includes both a Python API and a Gradio web interface.

§02

How it saves time or tokens

Zonos enables voice cloning from a single audio sample without fine-tuning, eliminating the hours of recording and training that traditional TTS customization requires. The model achieves approximately 2x real-time factor on an RTX 4090, meaning it generates speech faster than playback speed. The Gradio interface provides a visual way to adjust emotion, pitch, and rate without writing code.

§03

How to use

Install Zonos: pip install -e . in the cloned repository (requires a GPU with 6GB+ VRAM).
Load the model and generate speech using the Python API with Zonos.from_pretrained.
Alternatively, launch the Gradio web interface with uv run gradio_interface.py for interactive control.

§04

Example

from zonos.model import Zonos
import torchaudio

# Load model
model = Zonos.from_pretrained('Zyphra/Zonos-v0.1-transformer')

# Load a speaker reference audio for voice cloning
ref_audio, sr = torchaudio.load('reference_speaker.wav')

# Generate speech with cloned voice
output = model.generate(
    text='Hello, this is a voice cloning demonstration.',
    speaker_audio=ref_audio,
    language='en'
)

torchaudio.save('output.wav', output, 24000)

§05

Related on TokRepo

AI Tools for Voice -- explore voice synthesis and processing tools for AI applications
AI Tools for Content -- discover content creation workflows including audio generation

§06

Common pitfalls

Zonos requires a CUDA-compatible GPU with at least 6GB VRAM; CPU inference is not practical for real-time use.
Voice cloning quality depends heavily on the reference audio; use clean, noise-free samples of at least 5 seconds for best results.
The model weights are large (several GB); ensure sufficient disk space and bandwidth for the initial download from Hugging Face.

Preguntas frecuentes

What languages does Zonos support?+

Zonos supports five languages: English, Japanese, Chinese, French, and German. Each language was trained on substantial speech data to ensure natural pronunciation and intonation.

How does zero-shot voice cloning work?+

You provide a brief audio sample of the target speaker. Zonos extracts speaker characteristics from this sample and applies them to the generated speech without any fine-tuning or training step.

What GPU is required to run Zonos?+

Zonos requires a CUDA-compatible GPU with at least 6GB VRAM. An RTX 4090 achieves approximately 2x real-time generation speed. Smaller GPUs work but produce speech more slowly.

Is Zonos free to use commercially?+

Yes. Zonos is released under the Apache 2.0 license, which permits commercial use, modification, and distribution without royalty fees.

Can I control emotions in the generated speech?+

Yes. Zonos provides fine-grained controls for emotion, speaking rate, pitch, and audio quality. These parameters can be adjusted via the Python API or the Gradio web interface.

Referencias (3)

Zonos GitHub— Zonos is an open-weight TTS model trained on 200K+ hours of speech
Hugging Face— Zonos model weights on Hugging Face
Apache License— Apache 2.0 license for open-source software

Relacionados en TokRepo

AI voice tools AI content tools Featured workflows

🙏

Fuente y agradecimientos

Zyphra/Zonos — 7,200+ GitHub stars

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

VoxCPM — Tokenizer-Free Multilingual Text-to-Speech with Voice Cloning

Open-source TTS model by OpenBMB that generates natural multilingual speech and clones voices without traditional tokenization.

Scripts

Script Depot

OmniVoice Studio — Open-Source Voice Cloning and TTS Desktop App

OmniVoice Studio is a self-hosted desktop application for voice cloning, text-to-speech, dubbing, and dictation. It runs entirely on your local machine, providing a privacy-first alternative to cloud-based voice synthesis services.

Scripts

Script Depot

Fish Speech — Multilingual TTS for 80+ Languages

Fish Speech is a state-of-the-art open-source TTS system supporting 80+ languages. 29K+ GitHub stars. 4B dual-AR model, voice cloning, emotional control with 15K+ tags, real-time inference.

Skills

AI Open Source

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control.

Skills

AI Open Source