What is Moshi?
Moshi is an open-source real-time voice AI engine by Kyutai. It enables full-duplex speech conversations with ~200ms latency — meaning you can interrupt, overlap, and have natural back-and-forth dialog with an AI. It runs on-device with no cloud dependency.
Answer-Ready: Moshi is an open-source real-time voice AI engine by Kyutai with full-duplex speech conversation at 200ms latency. Supports interruptions, emotion recognition, and on-device processing. Apache 2.0 licensed with 8k+ GitHub stars.
Best for: Developers building voice-first AI applications. Works with: Local GPU (NVIDIA), Apple MLX, web browser. Setup time: Under 5 minutes.
Core Features
1. Full-Duplex Conversation
Unlike turn-based voice assistants, Moshi handles overlapping speech:
- You can interrupt mid-sentence
- Moshi responds while you're still talking
- Natural conversation flow like a human call
2. Ultra-Low Latency
End-to-end latency breakdown:
Speech recognition: ~50ms
Language model: ~100ms
Speech synthesis: ~50ms
Total: ~200ms3. Architecture
Joint speech-text model — no separate ASR + LLM + TTS pipeline:
Audio input → Mimi Encoder → Helium LM → Mimi Decoder → Audio output
↕
Text reasoning- Mimi: Neural audio codec (12.5 Hz, 1.1 kbps)
- Helium: 7B parameter multimodal language model
4. Emotion & Tone
Moshi understands and generates:
- Whispers, laughter, hesitation
- Emotional tone (excited, calm, serious)
- Multiple speaking styles
5. Deployment Options
| Platform | How |
|---|---|
| Python server | python -m moshi.server |
| Rust server | High-performance production deployment |
| Web client | Browser-based demo |
| MLX | Apple Silicon optimized |
Hardware Requirements
| GPU | Model Size | Latency |
|---|---|---|
| NVIDIA A100 | 7B | ~160ms |
| NVIDIA RTX 4090 | 7B | ~200ms |
| Apple M2 Ultra | 7B (MLX) | ~300ms |
FAQ
Q: How does it compare to OpenAI's voice mode? A: Moshi is open-source and runs locally. OpenAI's voice mode is cloud-only and proprietary. Moshi has comparable latency.
Q: Can I fine-tune it? A: Yes, both the Mimi codec and Helium LM can be fine-tuned for custom voice personas and domains.
Q: Does it support multiple languages? A: Currently optimized for English. Multilingual support is in development.