# VoxCPM — Tokenizer-Free Multilingual Text-to-Speech with Voice Cloning > Open-source TTS model by OpenBMB that generates natural multilingual speech and clones voices without traditional tokenization. ## Install Save as a script file and run: # VoxCPM — Tokenizer-Free Multilingual Text-to-Speech with Voice Cloning ## Quick Use ```bash pip install voxcpm python -m voxcpm.demo --text "Hello world" --output hello.wav ``` ## Introduction VoxCPM is an open-source text-to-speech system developed by OpenBMB that bypasses traditional text tokenization. It generates natural, expressive speech in multiple languages while supporting zero-shot voice cloning from short audio samples. ## What VoxCPM Does - Generates multilingual speech without relying on phoneme or text tokenizers - Performs zero-shot voice cloning from a few seconds of reference audio - Supports creative voice design with controllable speaker attributes - Delivers high-fidelity audio output comparable to commercial TTS systems - Handles code-switching and mixed-language text naturally ## Architecture Overview VoxCPM uses a continuous speech representation approach, processing raw audio waveforms rather than discrete tokens. The model is built on the MiniCPM foundation and employs a flow-matching decoder to produce high-quality audio. This tokenizer-free design eliminates information loss from quantization and enables more natural prosody. ## Self-Hosting & Configuration - Install via pip with PyTorch and CUDA support for GPU acceleration - Minimum 8 GB VRAM recommended for inference; 24 GB for fine-tuning - Configure language and speaker settings through YAML config files - Deploy as an API server with the built-in FastAPI endpoint - Supports ONNX export for edge deployment scenarios ## Key Features - Tokenizer-free architecture avoids discrete bottlenecks in speech generation - True-to-life voice cloning captures speaker timbre, rhythm, and emotion - Multi-language support spanning Chinese, English, Japanese, Korean, and more - Creative voice design lets you specify age, gender, and speaking style - Lightweight model variants available for resource-constrained environments ## Comparison with Similar Tools - **Bark** — generates speech plus music and effects but lacks precise voice cloning - **Fish Speech** — fast multilingual TTS with fewer languages and no tokenizer-free design - **Kokoro** — extremely lightweight at 82M parameters but limited language coverage - **F5-TTS** — flow-matching TTS with strong quality but no creative voice design controls - **ChatTTS** — dialogue-optimized TTS focused on conversational expressiveness ## FAQ **Q: What hardware do I need to run VoxCPM?** A: A modern NVIDIA GPU with at least 8 GB VRAM is recommended. CPU inference is possible but significantly slower. **Q: How much reference audio is needed for voice cloning?** A: As little as 3-5 seconds of clean speech can produce recognizable clones, though 10-30 seconds yields better quality. **Q: Can VoxCPM handle mixed-language sentences?** A: Yes. The tokenizer-free design handles code-switching between supported languages within a single utterance. **Q: Is VoxCPM suitable for real-time applications?** A: Streaming inference is supported, achieving near-real-time latency on modern GPUs. ## Sources - https://github.com/OpenBMB/VoxCPM - https://openbmb.cn/ --- Source: https://tokrepo.com/en/workflows/asset-76273a21 Author: Script Depot