Scripts2026年7月5日·1 分钟阅读

VoxCPM — Tokenizer-Free Multilingual Text-to-Speech with Voice Cloning

Open-source TTS model by OpenBMB that generates natural multilingual speech and clones voices without traditional tokenization.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
VoxCPM Overview
直接安装命令
npx -y tokrepo@latest install 76273a21-7808-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

VoxCPM is an open-source text-to-speech system developed by OpenBMB that bypasses traditional text tokenization. It generates natural, expressive speech in multiple languages while supporting zero-shot voice cloning from short audio samples.

What VoxCPM Does

  • Generates multilingual speech without relying on phoneme or text tokenizers
  • Performs zero-shot voice cloning from a few seconds of reference audio
  • Supports creative voice design with controllable speaker attributes
  • Delivers high-fidelity audio output comparable to commercial TTS systems
  • Handles code-switching and mixed-language text naturally

Architecture Overview

VoxCPM uses a continuous speech representation approach, processing raw audio waveforms rather than discrete tokens. The model is built on the MiniCPM foundation and employs a flow-matching decoder to produce high-quality audio. This tokenizer-free design eliminates information loss from quantization and enables more natural prosody.

Self-Hosting & Configuration

  • Install via pip with PyTorch and CUDA support for GPU acceleration
  • Minimum 8 GB VRAM recommended for inference; 24 GB for fine-tuning
  • Configure language and speaker settings through YAML config files
  • Deploy as an API server with the built-in FastAPI endpoint
  • Supports ONNX export for edge deployment scenarios

Key Features

  • Tokenizer-free architecture avoids discrete bottlenecks in speech generation
  • True-to-life voice cloning captures speaker timbre, rhythm, and emotion
  • Multi-language support spanning Chinese, English, Japanese, Korean, and more
  • Creative voice design lets you specify age, gender, and speaking style
  • Lightweight model variants available for resource-constrained environments

Comparison with Similar Tools

  • Bark — generates speech plus music and effects but lacks precise voice cloning
  • Fish Speech — fast multilingual TTS with fewer languages and no tokenizer-free design
  • Kokoro — extremely lightweight at 82M parameters but limited language coverage
  • F5-TTS — flow-matching TTS with strong quality but no creative voice design controls
  • ChatTTS — dialogue-optimized TTS focused on conversational expressiveness

FAQ

Q: What hardware do I need to run VoxCPM? A: A modern NVIDIA GPU with at least 8 GB VRAM is recommended. CPU inference is possible but significantly slower.

Q: How much reference audio is needed for voice cloning? A: As little as 3-5 seconds of clean speech can produce recognizable clones, though 10-30 seconds yields better quality.

Q: Can VoxCPM handle mixed-language sentences? A: Yes. The tokenizer-free design handles code-switching between supported languages within a single utterance.

Q: Is VoxCPM suitable for real-time applications? A: Streaming inference is supported, achieving near-real-time latency on modern GPUs.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产