Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsApr 7, 2026·2 min de lecture

Moshi — Real-Time AI Voice Conversation Engine

Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

What is Moshi?

Moshi is an open-source real-time voice AI engine by Kyutai. It enables full-duplex speech conversations with ~200ms latency — meaning you can interrupt, overlap, and have natural back-and-forth dialog with an AI. It runs on-device with no cloud dependency.

Answer-Ready: Moshi is an open-source real-time voice AI engine by Kyutai with full-duplex speech conversation at 200ms latency. Supports interruptions, emotion recognition, and on-device processing. Apache 2.0 licensed with 8k+ GitHub stars.

Best for: Developers building voice-first AI applications. Works with: Local GPU (NVIDIA), Apple MLX, web browser. Setup time: Under 5 minutes.

Core Features

1. Full-Duplex Conversation

Unlike turn-based voice assistants, Moshi handles overlapping speech:

  • You can interrupt mid-sentence
  • Moshi responds while you're still talking
  • Natural conversation flow like a human call

2. Ultra-Low Latency

End-to-end latency breakdown:

Speech recognition:  ~50ms
Language model:      ~100ms
Speech synthesis:    ~50ms
Total:               ~200ms

3. Architecture

Joint speech-text model — no separate ASR + LLM + TTS pipeline:

Audio input → Mimi Encoder → Helium LM → Mimi Decoder → Audio output
                                ↕
                          Text reasoning
  • Mimi: Neural audio codec (12.5 Hz, 1.1 kbps)
  • Helium: 7B parameter multimodal language model

4. Emotion & Tone

Moshi understands and generates:

  • Whispers, laughter, hesitation
  • Emotional tone (excited, calm, serious)
  • Multiple speaking styles

5. Deployment Options

Platform How
Python server python -m moshi.server
Rust server High-performance production deployment
Web client Browser-based demo
MLX Apple Silicon optimized

Hardware Requirements

GPU Model Size Latency
NVIDIA A100 7B ~160ms
NVIDIA RTX 4090 7B ~200ms
Apple M2 Ultra 7B (MLX) ~300ms

FAQ

Q: How does it compare to OpenAI's voice mode? A: Moshi is open-source and runs locally. OpenAI's voice mode is cloud-only and proprietary. Moshi has comparable latency.

Q: Can I fine-tune it? A: Yes, both the Mimi codec and Helium LM can be fine-tuned for custom voice personas and domains.

Q: Does it support multiple languages? A: Currently optimized for English. Multilingual support is in development.

🙏

Source et remerciements

Created by Kyutai. Licensed under Apache 2.0.

kyutai-labs/moshi — 8k+ stars

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires