Speech-to-Speech — Open-Source Voice AI Agent Builder by Hugging Face

Introduction

Speech-to-Speech (S2S) is an open-source project by Hugging Face that chains speech recognition, a language model, and text-to-speech into a real-time voice conversation pipeline. It runs locally and supports swapping individual components to customize the voice agent.

What Speech-to-Speech Does

Captures audio input and transcribes it using Whisper or other ASR models
Passes the transcript to an LLM (local or API-based) for response generation
Synthesizes the LLM response into speech using TTS models like Parler-TTS or MeloTTS
Handles voice activity detection to manage turn-taking in conversations
Streams audio output for low-latency conversational interaction

Architecture Overview

The pipeline consists of four modular stages: Voice Activity Detection (VAD) detects when the user is speaking, Speech-to-Text (STT) transcribes the audio, a Language Model (LM) generates a text response, and Text-to-Speech (TTS) synthesizes the reply. Each stage runs as an independent module communicating via queues, allowing components to be swapped independently. The pipeline supports both local models (via Transformers) and API-based models (OpenAI, Anthropic) for the LM stage.

Self-Hosting & Configuration

Clone the repository and install dependencies with pip on Python 3.10+
Requires a CUDA-capable GPU for real-time performance with local models
Configure the STT model with --stt-model-name (default: Whisper distil-large-v3)
Set the LLM with --llm-model-name for local models or --llm-url for API endpoints
Choose a TTS engine with --tts-model-name (Parler-TTS, MeloTTS, or others)

Key Features

Fully local operation with no cloud dependencies when using open-source models
Modular design lets you swap STT, LLM, and TTS components independently
Voice Activity Detection with configurable thresholds for natural turn-taking
Streaming output reduces perceived latency during conversations
Multi-language support depending on the chosen STT and TTS models

Comparison with Similar Tools

LiveKit Agents — Cloud-native voice agent framework; S2S is simpler and runs locally
Pipecat — Real-time voice AI framework with WebRTC; S2S focuses on local pipeline simplicity
Moshi — End-to-end speech model; S2S uses a modular pipeline of separate components
Vocode — Voice agent platform with telephony integrations; S2S is a lightweight local pipeline
Whisper + GPT + TTS — Manual integration of the same components; S2S provides a ready-made pipeline

FAQ

Q: Can this run without a GPU? A: CPU inference is possible but significantly slower. A CUDA GPU is recommended for real-time conversation.

Q: What languages does it support? A: Language support depends on the STT and TTS models chosen. Whisper supports 99 languages for transcription; TTS language coverage varies by model.

Q: Can I use commercial LLM APIs instead of local models? A: Yes. Set the --llm-url flag to point at an OpenAI-compatible API endpoint.

Q: Is there a web interface? A: The default interface is a terminal-based audio pipeline. Community forks add Gradio and web-based UIs.

Speech-to-Speech — Open-Source Voice AI Agent Builder by Hugging Face

Instalación lista para agent

Introduction

What Speech-to-Speech Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

OmniVoice Studio — Open-Source Voice Cloning and TTS Desktop App

Voicebox — Open-Source AI Voice Studio

Chatterbox — State-of-the-Art Open Source Text-to-Speech

SpeechBrain — Open-Source All-in-One Speech and Audio Processing Toolkit