Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 15, 2026·3 min de lecture

Voicebox — Open-Source AI Voice Studio

An open-source AI voice studio for voice cloning, text-to-speech dictation, and audio creation running locally with GPU acceleration on macOS and Linux.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Voicebox Overview
Commande CLI universelle
npx tokrepo install 74dca6e7-5079-11f1-9bc6-00163e2b0d79

Introduction

Voicebox is an open-source AI voice studio that provides voice cloning, text-to-speech synthesis, and dictation capabilities in a polished desktop-quality interface. It runs locally using GPU acceleration and supports multiple TTS backends, giving creators full control over voice generation without cloud dependencies.

What Voicebox Does

  • Clones voices from short audio samples for personalized TTS
  • Synthesizes speech from text with adjustable speed, pitch, and emotion
  • Provides a dictation mode for voice-to-text transcription
  • Supports multiple TTS model backends including Qwen3-TTS and Whisper
  • Runs entirely locally with CUDA or MLX acceleration

Architecture Overview

Voicebox is a TypeScript application with an Electron or web-based frontend and a local Python inference backend. The frontend provides an audio workstation-style interface for managing voice profiles, editing text, and monitoring generation. The backend orchestrates model loading, inference, and audio post-processing through a WebSocket connection, supporting hot-swapping between different TTS engines.

Self-Hosting & Configuration

  • Clone the repository and install Node.js and Python dependencies
  • Install CUDA toolkit for NVIDIA GPUs or use MLX on Apple Silicon
  • Download voice model checkpoints via the built-in model manager
  • Configure default voice profiles and output format in settings
  • Optionally run headless as an API server for integration with other tools

Key Features

  • Voice cloning from audio samples as short as 10 seconds
  • Multiple TTS backends with one-click switching
  • Real-time waveform preview and audio editing
  • Batch text-to-speech for processing scripts and documents
  • Local-first architecture with no data leaving your machine

Comparison with Similar Tools

  • ElevenLabs — cloud-based voice API; Voicebox is fully local and open-source
  • Bark — generates speech with effects; Voicebox provides a full studio interface
  • Kokoro — lightweight TTS model; Voicebox wraps multiple backends in a rich UI
  • F5-TTS — flow-matching synthesis; Voicebox integrates it as one of several engines

FAQ

Q: What GPU is required? A: An NVIDIA GPU with 6+ GB VRAM or Apple Silicon Mac with MLX support is recommended.

Q: How long does voice cloning take? A: Cloning a voice profile from a 10-second sample typically completes in under a minute.

Q: Can I use cloned voices commercially? A: The software is open-source, but you are responsible for ensuring you have consent and legal rights for any voice you clone.

Q: Does it support real-time synthesis? A: Yes, streaming synthesis is available for interactive applications.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires