Introduction
Speech-to-Speech (S2S) is an open-source project by Hugging Face that chains speech recognition, a language model, and text-to-speech into a real-time voice conversation pipeline. It runs locally and supports swapping individual components to customize the voice agent.
What Speech-to-Speech Does
- Captures audio input and transcribes it using Whisper or other ASR models
- Passes the transcript to an LLM (local or API-based) for response generation
- Synthesizes the LLM response into speech using TTS models like Parler-TTS or MeloTTS
- Handles voice activity detection to manage turn-taking in conversations
- Streams audio output for low-latency conversational interaction
Architecture Overview
The pipeline consists of four modular stages: Voice Activity Detection (VAD) detects when the user is speaking, Speech-to-Text (STT) transcribes the audio, a Language Model (LM) generates a text response, and Text-to-Speech (TTS) synthesizes the reply. Each stage runs as an independent module communicating via queues, allowing components to be swapped independently. The pipeline supports both local models (via Transformers) and API-based models (OpenAI, Anthropic) for the LM stage.
Self-Hosting & Configuration
- Clone the repository and install dependencies with pip on Python 3.10+
- Requires a CUDA-capable GPU for real-time performance with local models
- Configure the STT model with
--stt-model-name(default: Whisper distil-large-v3) - Set the LLM with
--llm-model-namefor local models or--llm-urlfor API endpoints - Choose a TTS engine with
--tts-model-name(Parler-TTS, MeloTTS, or others)
Key Features
- Fully local operation with no cloud dependencies when using open-source models
- Modular design lets you swap STT, LLM, and TTS components independently
- Voice Activity Detection with configurable thresholds for natural turn-taking
- Streaming output reduces perceived latency during conversations
- Multi-language support depending on the chosen STT and TTS models
Comparison with Similar Tools
- LiveKit Agents — Cloud-native voice agent framework; S2S is simpler and runs locally
- Pipecat — Real-time voice AI framework with WebRTC; S2S focuses on local pipeline simplicity
- Moshi — End-to-end speech model; S2S uses a modular pipeline of separate components
- Vocode — Voice agent platform with telephony integrations; S2S is a lightweight local pipeline
- Whisper + GPT + TTS — Manual integration of the same components; S2S provides a ready-made pipeline
FAQ
Q: Can this run without a GPU? A: CPU inference is possible but significantly slower. A CUDA GPU is recommended for real-time conversation.
Q: What languages does it support? A: Language support depends on the STT and TTS models chosen. Whisper supports 99 languages for transcription; TTS language coverage varies by model.
Q: Can I use commercial LLM APIs instead of local models?
A: Yes. Set the --llm-url flag to point at an OpenAI-compatible API endpoint.
Q: Is there a web interface? A: The default interface is a terminal-based audio pipeline. Community forks add Gradio and web-based UIs.