Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsJul 2, 2026·3 min de lectura

Speech-to-Speech — Open-Source Voice AI Agent Builder by Hugging Face

Build local voice-driven AI agents with a modular pipeline connecting speech recognition, language models, and text-to-speech synthesis.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Speech-to-Speech Overview
Comando de instalación directa
npx -y tokrepo@latest install 41588400-7658-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

Speech-to-Speech (S2S) is an open-source project by Hugging Face that chains speech recognition, a language model, and text-to-speech into a real-time voice conversation pipeline. It runs locally and supports swapping individual components to customize the voice agent.

What Speech-to-Speech Does

  • Captures audio input and transcribes it using Whisper or other ASR models
  • Passes the transcript to an LLM (local or API-based) for response generation
  • Synthesizes the LLM response into speech using TTS models like Parler-TTS or MeloTTS
  • Handles voice activity detection to manage turn-taking in conversations
  • Streams audio output for low-latency conversational interaction

Architecture Overview

The pipeline consists of four modular stages: Voice Activity Detection (VAD) detects when the user is speaking, Speech-to-Text (STT) transcribes the audio, a Language Model (LM) generates a text response, and Text-to-Speech (TTS) synthesizes the reply. Each stage runs as an independent module communicating via queues, allowing components to be swapped independently. The pipeline supports both local models (via Transformers) and API-based models (OpenAI, Anthropic) for the LM stage.

Self-Hosting & Configuration

  • Clone the repository and install dependencies with pip on Python 3.10+
  • Requires a CUDA-capable GPU for real-time performance with local models
  • Configure the STT model with --stt-model-name (default: Whisper distil-large-v3)
  • Set the LLM with --llm-model-name for local models or --llm-url for API endpoints
  • Choose a TTS engine with --tts-model-name (Parler-TTS, MeloTTS, or others)

Key Features

  • Fully local operation with no cloud dependencies when using open-source models
  • Modular design lets you swap STT, LLM, and TTS components independently
  • Voice Activity Detection with configurable thresholds for natural turn-taking
  • Streaming output reduces perceived latency during conversations
  • Multi-language support depending on the chosen STT and TTS models

Comparison with Similar Tools

  • LiveKit Agents — Cloud-native voice agent framework; S2S is simpler and runs locally
  • Pipecat — Real-time voice AI framework with WebRTC; S2S focuses on local pipeline simplicity
  • Moshi — End-to-end speech model; S2S uses a modular pipeline of separate components
  • Vocode — Voice agent platform with telephony integrations; S2S is a lightweight local pipeline
  • Whisper + GPT + TTS — Manual integration of the same components; S2S provides a ready-made pipeline

FAQ

Q: Can this run without a GPU? A: CPU inference is possible but significantly slower. A CUDA GPU is recommended for real-time conversation.

Q: What languages does it support? A: Language support depends on the STT and TTS models chosen. Whisper supports 99 languages for transcription; TTS language coverage varies by model.

Q: Can I use commercial LLM APIs instead of local models? A: Yes. Set the --llm-url flag to point at an OpenAI-compatible API endpoint.

Q: Is there a web interface? A: The default interface is a terminal-based audio pipeline. Community forks add Gradio and web-based UIs.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados