Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsJul 5, 2026·3 min de lectura

VoxCPM — Tokenizer-Free Multilingual Text-to-Speech with Voice Cloning

Open-source TTS model by OpenBMB that generates natural multilingual speech and clones voices without traditional tokenization.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
VoxCPM Overview
Comando de instalación directa
npx -y tokrepo@latest install 76273a21-7808-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

VoxCPM is an open-source text-to-speech system developed by OpenBMB that bypasses traditional text tokenization. It generates natural, expressive speech in multiple languages while supporting zero-shot voice cloning from short audio samples.

What VoxCPM Does

  • Generates multilingual speech without relying on phoneme or text tokenizers
  • Performs zero-shot voice cloning from a few seconds of reference audio
  • Supports creative voice design with controllable speaker attributes
  • Delivers high-fidelity audio output comparable to commercial TTS systems
  • Handles code-switching and mixed-language text naturally

Architecture Overview

VoxCPM uses a continuous speech representation approach, processing raw audio waveforms rather than discrete tokens. The model is built on the MiniCPM foundation and employs a flow-matching decoder to produce high-quality audio. This tokenizer-free design eliminates information loss from quantization and enables more natural prosody.

Self-Hosting & Configuration

  • Install via pip with PyTorch and CUDA support for GPU acceleration
  • Minimum 8 GB VRAM recommended for inference; 24 GB for fine-tuning
  • Configure language and speaker settings through YAML config files
  • Deploy as an API server with the built-in FastAPI endpoint
  • Supports ONNX export for edge deployment scenarios

Key Features

  • Tokenizer-free architecture avoids discrete bottlenecks in speech generation
  • True-to-life voice cloning captures speaker timbre, rhythm, and emotion
  • Multi-language support spanning Chinese, English, Japanese, Korean, and more
  • Creative voice design lets you specify age, gender, and speaking style
  • Lightweight model variants available for resource-constrained environments

Comparison with Similar Tools

  • Bark — generates speech plus music and effects but lacks precise voice cloning
  • Fish Speech — fast multilingual TTS with fewer languages and no tokenizer-free design
  • Kokoro — extremely lightweight at 82M parameters but limited language coverage
  • F5-TTS — flow-matching TTS with strong quality but no creative voice design controls
  • ChatTTS — dialogue-optimized TTS focused on conversational expressiveness

FAQ

Q: What hardware do I need to run VoxCPM? A: A modern NVIDIA GPU with at least 8 GB VRAM is recommended. CPU inference is possible but significantly slower.

Q: How much reference audio is needed for voice cloning? A: As little as 3-5 seconds of clean speech can produce recognizable clones, though 10-30 seconds yields better quality.

Q: Can VoxCPM handle mixed-language sentences? A: Yes. The tokenizer-free design handles code-switching between supported languages within a single utterance.

Q: Is VoxCPM suitable for real-time applications? A: Streaming inference is supported, achieving near-real-time latency on modern GPUs.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados