Scripts2026年5月15日·1 分钟阅读

Voicebox — Open-Source AI Voice Studio

An open-source AI voice studio for voice cloning, text-to-speech dictation, and audio creation running locally with GPU acceleration on macOS and Linux.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Voicebox Overview
通用 CLI 安装命令
npx tokrepo install 74dca6e7-5079-11f1-9bc6-00163e2b0d79

Introduction

Voicebox is an open-source AI voice studio that provides voice cloning, text-to-speech synthesis, and dictation capabilities in a polished desktop-quality interface. It runs locally using GPU acceleration and supports multiple TTS backends, giving creators full control over voice generation without cloud dependencies.

What Voicebox Does

  • Clones voices from short audio samples for personalized TTS
  • Synthesizes speech from text with adjustable speed, pitch, and emotion
  • Provides a dictation mode for voice-to-text transcription
  • Supports multiple TTS model backends including Qwen3-TTS and Whisper
  • Runs entirely locally with CUDA or MLX acceleration

Architecture Overview

Voicebox is a TypeScript application with an Electron or web-based frontend and a local Python inference backend. The frontend provides an audio workstation-style interface for managing voice profiles, editing text, and monitoring generation. The backend orchestrates model loading, inference, and audio post-processing through a WebSocket connection, supporting hot-swapping between different TTS engines.

Self-Hosting & Configuration

  • Clone the repository and install Node.js and Python dependencies
  • Install CUDA toolkit for NVIDIA GPUs or use MLX on Apple Silicon
  • Download voice model checkpoints via the built-in model manager
  • Configure default voice profiles and output format in settings
  • Optionally run headless as an API server for integration with other tools

Key Features

  • Voice cloning from audio samples as short as 10 seconds
  • Multiple TTS backends with one-click switching
  • Real-time waveform preview and audio editing
  • Batch text-to-speech for processing scripts and documents
  • Local-first architecture with no data leaving your machine

Comparison with Similar Tools

  • ElevenLabs — cloud-based voice API; Voicebox is fully local and open-source
  • Bark — generates speech with effects; Voicebox provides a full studio interface
  • Kokoro — lightweight TTS model; Voicebox wraps multiple backends in a rich UI
  • F5-TTS — flow-matching synthesis; Voicebox integrates it as one of several engines

FAQ

Q: What GPU is required? A: An NVIDIA GPU with 6+ GB VRAM or Apple Silicon Mac with MLX support is recommended.

Q: How long does voice cloning take? A: Cloning a voice profile from a 10-second sample typically completes in under a minute.

Q: Can I use cloned voices commercially? A: The software is open-source, but you are responsible for ensuring you have consent and legal rights for any voice you clone.

Q: Does it support real-time synthesis? A: Yes, streaming synthesis is available for interactive applications.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产