What is Moshi — Real-Time AI Voice Conversation Engine?

Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

Is Moshi — Real-Time AI Voice Conversation Engine free to use?

Yes. Moshi — Real-Time AI Voice Conversation Engine is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Moshi — Real-Time AI Voice Conversation Engine?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Moshi — Real-Time AI Voice Conversation Engine

What is Moshi?

Moshi is an open-source real-time voice AI engine by Kyutai. It enables full-duplex speech conversations with ~200ms latency — meaning you can interrupt, overlap, and have natural back-and-forth dialog with an AI. It runs on-device with no cloud dependency.

Answer-Ready: Moshi is an open-source real-time voice AI engine by Kyutai with full-duplex speech conversation at 200ms latency. Supports interruptions, emotion recognition, and on-device processing. Apache 2.0 licensed with 8k+ GitHub stars.

Best for: Developers building voice-first AI applications. Works with: Local GPU (NVIDIA), Apple MLX, web browser. Setup time: Under 5 minutes.

Core Features

1. Full-Duplex Conversation

Unlike turn-based voice assistants, Moshi handles overlapping speech:

You can interrupt mid-sentence
Moshi responds while you're still talking
Natural conversation flow like a human call

2. Ultra-Low Latency

End-to-end latency breakdown:

Speech recognition:  ~50ms
Language model:      ~100ms
Speech synthesis:    ~50ms
Total:               ~200ms

3. Architecture

Joint speech-text model — no separate ASR + LLM + TTS pipeline:

Audio input → Mimi Encoder → Helium LM → Mimi Decoder → Audio output
                                ↕
                          Text reasoning

Mimi: Neural audio codec (12.5 Hz, 1.1 kbps)
Helium: 7B parameter multimodal language model

4. Emotion & Tone

Moshi understands and generates:

Whispers, laughter, hesitation
Emotional tone (excited, calm, serious)
Multiple speaking styles

5. Deployment Options

Platform	How
Python server	`python -m moshi.server`
Rust server	High-performance production deployment
Web client	Browser-based demo
MLX	Apple Silicon optimized

Hardware Requirements

GPU	Model Size	Latency
NVIDIA A100	7B	~160ms
NVIDIA RTX 4090	7B	~200ms
Apple M2 Ultra	7B (MLX)	~300ms

FAQ

Q: How does it compare to OpenAI's voice mode? A: Moshi is open-source and runs locally. OpenAI's voice mode is cloud-only and proprietary. Moshi has comparable latency.

Q: Can I fine-tune it? A: Yes, both the Mimi codec and Helium LM can be fine-tuned for custom voice personas and domains.

Q: Does it support multiple languages? A: Currently optimized for English. Multilingual support is in development.