# Voicebox — Open-Source AI Voice Studio

> An open-source AI voice studio for voice cloning, text-to-speech dictation, and audio creation running locally with GPU acceleration on macOS and Linux.

## Install

Save as a script file and run:

# Voicebox — Open-Source AI Voice Studio

## Quick Use
```bash
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
npm install
npm run dev
```

## Introduction
Voicebox is an open-source AI voice studio that provides voice cloning, text-to-speech synthesis, and dictation capabilities in a polished desktop-quality interface. It runs locally using GPU acceleration and supports multiple TTS backends, giving creators full control over voice generation without cloud dependencies.

## What Voicebox Does
- Clones voices from short audio samples for personalized TTS
- Synthesizes speech from text with adjustable speed, pitch, and emotion
- Provides a dictation mode for voice-to-text transcription
- Supports multiple TTS model backends including Qwen3-TTS and Whisper
- Runs entirely locally with CUDA or MLX acceleration

## Architecture Overview
Voicebox is a TypeScript application with an Electron or web-based frontend and a local Python inference backend. The frontend provides an audio workstation-style interface for managing voice profiles, editing text, and monitoring generation. The backend orchestrates model loading, inference, and audio post-processing through a WebSocket connection, supporting hot-swapping between different TTS engines.

## Self-Hosting & Configuration
- Clone the repository and install Node.js and Python dependencies
- Install CUDA toolkit for NVIDIA GPUs or use MLX on Apple Silicon
- Download voice model checkpoints via the built-in model manager
- Configure default voice profiles and output format in settings
- Optionally run headless as an API server for integration with other tools

## Key Features
- Voice cloning from audio samples as short as 10 seconds
- Multiple TTS backends with one-click switching
- Real-time waveform preview and audio editing
- Batch text-to-speech for processing scripts and documents
- Local-first architecture with no data leaving your machine

## Comparison with Similar Tools
- **ElevenLabs** — cloud-based voice API; Voicebox is fully local and open-source
- **Bark** — generates speech with effects; Voicebox provides a full studio interface
- **Kokoro** — lightweight TTS model; Voicebox wraps multiple backends in a rich UI
- **F5-TTS** — flow-matching synthesis; Voicebox integrates it as one of several engines

## FAQ
**Q: What GPU is required?**
A: An NVIDIA GPU with 6+ GB VRAM or Apple Silicon Mac with MLX support is recommended.

**Q: How long does voice cloning take?**
A: Cloning a voice profile from a 10-second sample typically completes in under a minute.

**Q: Can I use cloned voices commercially?**
A: The software is open-source, but you are responsible for ensuring you have consent and legal rights for any voice you clone.

**Q: Does it support real-time synthesis?**
A: Yes, streaming synthesis is available for interactive applications.

## Sources
- https://github.com/jamiepine/voicebox

---
Source: https://tokrepo.com/en/workflows/asset-74dca6e7
Author: Script Depot