Introduction
Voicebox is an open-source AI voice studio that provides voice cloning, text-to-speech synthesis, and dictation capabilities in a polished desktop-quality interface. It runs locally using GPU acceleration and supports multiple TTS backends, giving creators full control over voice generation without cloud dependencies.
What Voicebox Does
- Clones voices from short audio samples for personalized TTS
- Synthesizes speech from text with adjustable speed, pitch, and emotion
- Provides a dictation mode for voice-to-text transcription
- Supports multiple TTS model backends including Qwen3-TTS and Whisper
- Runs entirely locally with CUDA or MLX acceleration
Architecture Overview
Voicebox is a TypeScript application with an Electron or web-based frontend and a local Python inference backend. The frontend provides an audio workstation-style interface for managing voice profiles, editing text, and monitoring generation. The backend orchestrates model loading, inference, and audio post-processing through a WebSocket connection, supporting hot-swapping between different TTS engines.
Self-Hosting & Configuration
- Clone the repository and install Node.js and Python dependencies
- Install CUDA toolkit for NVIDIA GPUs or use MLX on Apple Silicon
- Download voice model checkpoints via the built-in model manager
- Configure default voice profiles and output format in settings
- Optionally run headless as an API server for integration with other tools
Key Features
- Voice cloning from audio samples as short as 10 seconds
- Multiple TTS backends with one-click switching
- Real-time waveform preview and audio editing
- Batch text-to-speech for processing scripts and documents
- Local-first architecture with no data leaving your machine
Comparison with Similar Tools
- ElevenLabs — cloud-based voice API; Voicebox is fully local and open-source
- Bark — generates speech with effects; Voicebox provides a full studio interface
- Kokoro — lightweight TTS model; Voicebox wraps multiple backends in a rich UI
- F5-TTS — flow-matching synthesis; Voicebox integrates it as one of several engines
FAQ
Q: What GPU is required? A: An NVIDIA GPU with 6+ GB VRAM or Apple Silicon Mac with MLX support is recommended.
Q: How long does voice cloning take? A: Cloning a voice profile from a 10-second sample typically completes in under a minute.
Q: Can I use cloned voices commercially? A: The software is open-source, but you are responsible for ensuring you have consent and legal rights for any voice you clone.
Q: Does it support real-time synthesis? A: Yes, streaming synthesis is available for interactive applications.