ScriptsMay 15, 2026·3 min read

Voicebox — Open-Source AI Voice Studio

An open-source AI voice studio for voice cloning, text-to-speech dictation, and audio creation running locally with GPU acceleration on macOS and Linux.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
Voicebox Overview
Universal CLI install command
npx tokrepo install 74dca6e7-5079-11f1-9bc6-00163e2b0d79

Introduction

Voicebox is an open-source AI voice studio that provides voice cloning, text-to-speech synthesis, and dictation capabilities in a polished desktop-quality interface. It runs locally using GPU acceleration and supports multiple TTS backends, giving creators full control over voice generation without cloud dependencies.

What Voicebox Does

  • Clones voices from short audio samples for personalized TTS
  • Synthesizes speech from text with adjustable speed, pitch, and emotion
  • Provides a dictation mode for voice-to-text transcription
  • Supports multiple TTS model backends including Qwen3-TTS and Whisper
  • Runs entirely locally with CUDA or MLX acceleration

Architecture Overview

Voicebox is a TypeScript application with an Electron or web-based frontend and a local Python inference backend. The frontend provides an audio workstation-style interface for managing voice profiles, editing text, and monitoring generation. The backend orchestrates model loading, inference, and audio post-processing through a WebSocket connection, supporting hot-swapping between different TTS engines.

Self-Hosting & Configuration

  • Clone the repository and install Node.js and Python dependencies
  • Install CUDA toolkit for NVIDIA GPUs or use MLX on Apple Silicon
  • Download voice model checkpoints via the built-in model manager
  • Configure default voice profiles and output format in settings
  • Optionally run headless as an API server for integration with other tools

Key Features

  • Voice cloning from audio samples as short as 10 seconds
  • Multiple TTS backends with one-click switching
  • Real-time waveform preview and audio editing
  • Batch text-to-speech for processing scripts and documents
  • Local-first architecture with no data leaving your machine

Comparison with Similar Tools

  • ElevenLabs — cloud-based voice API; Voicebox is fully local and open-source
  • Bark — generates speech with effects; Voicebox provides a full studio interface
  • Kokoro — lightweight TTS model; Voicebox wraps multiple backends in a rich UI
  • F5-TTS — flow-matching synthesis; Voicebox integrates it as one of several engines

FAQ

Q: What GPU is required? A: An NVIDIA GPU with 6+ GB VRAM or Apple Silicon Mac with MLX support is recommended.

Q: How long does voice cloning take? A: Cloning a voice profile from a 10-second sample typically completes in under a minute.

Q: Can I use cloned voices commercially? A: The software is open-source, but you are responsible for ensuring you have consent and legal rights for any voice you clone.

Q: Does it support real-time synthesis? A: Yes, streaming synthesis is available for interactive applications.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets