ConfigsApr 7, 2026·2 min read

Moshi — Real-Time AI Voice Conversation Engine

Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

AI
AI Open Source · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install moshi
python -m moshi.server

Open http://localhost:8998 — start talking to Moshi in real-time.

What is Moshi?

Moshi is an open-source real-time voice AI engine by Kyutai. It enables full-duplex speech conversations with ~200ms latency — meaning you can interrupt, overlap, and have natural back-and-forth dialog with an AI. It runs on-device with no cloud dependency.

Answer-Ready: Moshi is an open-source real-time voice AI engine by Kyutai with full-duplex speech conversation at 200ms latency. Supports interruptions, emotion recognition, and on-device processing. Apache 2.0 licensed with 8k+ GitHub stars.

Best for: Developers building voice-first AI applications. Works with: Local GPU (NVIDIA), Apple MLX, web browser. Setup time: Under 5 minutes.

Core Features

1. Full-Duplex Conversation

Unlike turn-based voice assistants, Moshi handles overlapping speech:

  • You can interrupt mid-sentence
  • Moshi responds while you're still talking
  • Natural conversation flow like a human call

2. Ultra-Low Latency

End-to-end latency breakdown:

Speech recognition:  ~50ms
Language model:      ~100ms
Speech synthesis:    ~50ms
Total:               ~200ms

3. Architecture

Joint speech-text model — no separate ASR + LLM + TTS pipeline:

Audio input → Mimi Encoder → Helium LM → Mimi Decoder → Audio output
                                ↕
                          Text reasoning
  • Mimi: Neural audio codec (12.5 Hz, 1.1 kbps)
  • Helium: 7B parameter multimodal language model

4. Emotion & Tone

Moshi understands and generates:

  • Whispers, laughter, hesitation
  • Emotional tone (excited, calm, serious)
  • Multiple speaking styles

5. Deployment Options

Platform How
Python server python -m moshi.server
Rust server High-performance production deployment
Web client Browser-based demo
MLX Apple Silicon optimized

Hardware Requirements

GPU Model Size Latency
NVIDIA A100 7B ~160ms
NVIDIA RTX 4090 7B ~200ms
Apple M2 Ultra 7B (MLX) ~300ms

FAQ

Q: How does it compare to OpenAI's voice mode? A: Moshi is open-source and runs locally. OpenAI's voice mode is cloud-only and proprietary. Moshi has comparable latency.

Q: Can I fine-tune it? A: Yes, both the Mimi codec and Helium LM can be fine-tuned for custom voice personas and domains.

Q: Does it support multiple languages? A: Currently optimized for English. Multilingual support is in development.

🙏

Source & Thanks

Created by Kyutai. Licensed under Apache 2.0.

kyutai-labs/moshi — 8k+ stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.