Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsJul 5, 2026·3 min de lecture

VoxCPM — Tokenizer-Free Multilingual Text-to-Speech with Voice Cloning

Open-source TTS model by OpenBMB that generates natural multilingual speech and clones voices without traditional tokenization.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
VoxCPM Overview
Commande d'installation directe
npx -y tokrepo@latest install 76273a21-7808-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

VoxCPM is an open-source text-to-speech system developed by OpenBMB that bypasses traditional text tokenization. It generates natural, expressive speech in multiple languages while supporting zero-shot voice cloning from short audio samples.

What VoxCPM Does

  • Generates multilingual speech without relying on phoneme or text tokenizers
  • Performs zero-shot voice cloning from a few seconds of reference audio
  • Supports creative voice design with controllable speaker attributes
  • Delivers high-fidelity audio output comparable to commercial TTS systems
  • Handles code-switching and mixed-language text naturally

Architecture Overview

VoxCPM uses a continuous speech representation approach, processing raw audio waveforms rather than discrete tokens. The model is built on the MiniCPM foundation and employs a flow-matching decoder to produce high-quality audio. This tokenizer-free design eliminates information loss from quantization and enables more natural prosody.

Self-Hosting & Configuration

  • Install via pip with PyTorch and CUDA support for GPU acceleration
  • Minimum 8 GB VRAM recommended for inference; 24 GB for fine-tuning
  • Configure language and speaker settings through YAML config files
  • Deploy as an API server with the built-in FastAPI endpoint
  • Supports ONNX export for edge deployment scenarios

Key Features

  • Tokenizer-free architecture avoids discrete bottlenecks in speech generation
  • True-to-life voice cloning captures speaker timbre, rhythm, and emotion
  • Multi-language support spanning Chinese, English, Japanese, Korean, and more
  • Creative voice design lets you specify age, gender, and speaking style
  • Lightweight model variants available for resource-constrained environments

Comparison with Similar Tools

  • Bark — generates speech plus music and effects but lacks precise voice cloning
  • Fish Speech — fast multilingual TTS with fewer languages and no tokenizer-free design
  • Kokoro — extremely lightweight at 82M parameters but limited language coverage
  • F5-TTS — flow-matching TTS with strong quality but no creative voice design controls
  • ChatTTS — dialogue-optimized TTS focused on conversational expressiveness

FAQ

Q: What hardware do I need to run VoxCPM? A: A modern NVIDIA GPU with at least 8 GB VRAM is recommended. CPU inference is possible but significantly slower.

Q: How much reference audio is needed for voice cloning? A: As little as 3-5 seconds of clean speech can produce recognizable clones, though 10-30 seconds yields better quality.

Q: Can VoxCPM handle mixed-language sentences? A: Yes. The tokenizer-free design handles code-switching between supported languages within a single utterance.

Q: Is VoxCPM suitable for real-time applications? A: Streaming inference is supported, achieving near-real-time latency on modern GPUs.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires