What is Moshi?
Moshi is Kyutai's open-source real-time voice AI engine, supporting full-duplex conversation with 200ms latency, emotion recognition, and local execution.
In one sentence: Open-source real-time voice AI — full-duplex conversation at 200ms latency, supports interruption and emotion recognition, runs locally — 8k+ GitHub stars.
For: Developers building voice-first AI applications.
Core Features
1. Full-Duplex Conversation
Supports interruption and overlapping speech — like natural conversation.
2. 200ms Latency
Ultra-low end-to-end latency — no cloud needed.
3. Emotion and Tone
Understands and generates whispers, laughter, hesitation, and more.
4. Local Deployment
Multi-platform support: NVIDIA GPU, Apple MLX, browser.
FAQ
Q: How does it compare to OpenAI's voice mode? A: Moshi is open source and runs locally; OpenAI is cloud-based and closed. Latency is comparable.
Q: Does it support Chinese? A: Currently English-first; multi-language is in development.