Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsApr 28, 2026·3 min de lectura

OpenVoice — Instant Voice Cloning with Tone and Style Control

OpenVoice is an open-source voice cloning framework from MyShell AI that reproduces a speaker's voice from a short audio sample while giving independent control over emotion, accent, rhythm, and language.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
OpenVoice Overview
Comando de instalación directa
npx -y tokrepo@latest install ae7169ee-42b9-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

OpenVoice is a voice cloning library developed by MyShell AI and researchers from MIT and Tsinghua University. It can replicate a target speaker's voice from a brief reference clip and synthesize speech in multiple languages, while allowing fine-grained control over style parameters like emotion, accent, and speaking pace.

What OpenVoice Does

  • Clones a voice from a short reference audio clip (as little as a few seconds)
  • Synthesizes speech in English, Chinese, Japanese, Korean, French, and more
  • Provides independent control over emotion, rhythm, pauses, and intonation
  • Supports cross-lingual voice cloning where the reference and output languages differ
  • Runs locally without sending audio data to external services

Architecture Overview

OpenVoice uses a two-stage pipeline. The first stage is a base TTS model that generates speech with controllable style parameters (emotion, speed, pitch). The second stage is a tone color converter that transfers the target speaker's voice characteristics onto the base output. This decoupled design allows flexible style manipulation without retraining the voice cloning component.

Self-Hosting & Configuration

  • Install via pip or clone the repository and install dependencies
  • Download pre-trained checkpoints for the base speaker and tone color converter
  • Requires Python 3.9+ and PyTorch; GPU recommended for real-time synthesis
  • Reference audio should be clean speech without background music or noise
  • Adjust emotion, speed, and pitch parameters in the generation call

Key Features

  • Near-instant voice cloning from a few seconds of reference audio
  • Decoupled style and timbre control for creative flexibility
  • Cross-lingual synthesis without language-specific voice samples
  • Fully local inference with no cloud dependency
  • MIT-licensed for both research and commercial applications

Comparison with Similar Tools

  • Coqui TTS — broader TTS toolkit; voice cloning requires more reference data
  • Bark — generates speech, music, and sound effects; less precise voice cloning
  • XTTS — Coqui's cloning model; similar quality but different architecture
  • Fish Speech — multilingual TTS; focuses on naturalness over cloning fidelity
  • F5-TTS — flow-matching approach; strong zero-shot but fewer style controls

FAQ

Q: How much reference audio is needed? A: A clean clip of 5-30 seconds works well. Longer clips can improve timbre accuracy but are not required.

Q: Can I use OpenVoice for real-time applications? A: On a modern GPU, synthesis is faster than real-time. CPU inference is possible but significantly slower.

Q: Does it handle singing or non-speech audio? A: OpenVoice is designed for speech synthesis. For singing, consider dedicated singing voice synthesis tools.

Q: Is the output watermarked? A: The model does not embed watermarks. Users are responsible for ethical use and local regulations.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados