# Moshi — Real-Time AI Voice Conversation Engine

> Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

```bash
pip install moshi
python -m moshi.server
```

Open `http://localhost:8998` — start talking to Moshi in real-time.

## What is Moshi?

Moshi is an open-source real-time voice AI engine by Kyutai. It enables full-duplex speech conversations with ~200ms latency — meaning you can interrupt, overlap, and have natural back-and-forth dialog with an AI. It runs on-device with no cloud dependency.

**Answer-Ready**: Moshi is an open-source real-time voice AI engine by Kyutai with full-duplex speech conversation at 200ms latency. Supports interruptions, emotion recognition, and on-device processing. Apache 2.0 licensed with 8k+ GitHub stars.

**Best for**: Developers building voice-first AI applications. **Works with**: Local GPU (NVIDIA), Apple MLX, web browser. **Setup time**: Under 5 minutes.

## Core Features

### 1. Full-Duplex Conversation
Unlike turn-based voice assistants, Moshi handles overlapping speech:
- You can interrupt mid-sentence
- Moshi responds while you're still talking
- Natural conversation flow like a human call

### 2. Ultra-Low Latency
End-to-end latency breakdown:
```
Speech recognition:  ~50ms
Language model:      ~100ms
Speech synthesis:    ~50ms
Total:               ~200ms
```

### 3. Architecture
Joint speech-text model — no separate ASR + LLM + TTS pipeline:

```
Audio input → Mimi Encoder → Helium LM → Mimi Decoder → Audio output
                                ↕
                          Text reasoning
```

- **Mimi**: Neural audio codec (12.5 Hz, 1.1 kbps)
- **Helium**: 7B parameter multimodal language model

### 4. Emotion & Tone
Moshi understands and generates:
- Whispers, laughter, hesitation
- Emotional tone (excited, calm, serious)
- Multiple speaking styles

### 5. Deployment Options

| Platform | How |
|----------|-----|
| Python server | `python -m moshi.server` |
| Rust server | High-performance production deployment |
| Web client | Browser-based demo |
| MLX | Apple Silicon optimized |

## Hardware Requirements

| GPU | Model Size | Latency |
|-----|-----------|---------|
| NVIDIA A100 | 7B | ~160ms |
| NVIDIA RTX 4090 | 7B | ~200ms |
| Apple M2 Ultra | 7B (MLX) | ~300ms |

## FAQ

**Q: How does it compare to OpenAI's voice mode?**
A: Moshi is open-source and runs locally. OpenAI's voice mode is cloud-only and proprietary. Moshi has comparable latency.

**Q: Can I fine-tune it?**
A: Yes, both the Mimi codec and Helium LM can be fine-tuned for custom voice personas and domains.

**Q: Does it support multiple languages?**
A: Currently optimized for English. Multilingual support is in development.

## Source & Thanks

> Created by [Kyutai](https://github.com/kyutai-labs). Licensed under Apache 2.0.
>
> [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi) — 8k+ stars

<!-- ZH -->


## Quick Start

```bash
pip install moshi
python -m moshi.server
```

Open `localhost:8998` in your browser to start real-time voice conversations.

## What is Moshi?

Moshi is Kyutai's open-source real-time voice AI engine, supporting full-duplex conversation with 200ms latency, emotion recognition, and local execution.

**In one sentence**: Open-source real-time voice AI — full-duplex conversation at 200ms latency, supports interruption and emotion recognition, runs locally — 8k+ GitHub stars.

**For**: Developers building voice-first AI applications.

## Core Features

### 1. Full-Duplex Conversation
Supports interruption and overlapping speech — like natural conversation.

### 2. 200ms Latency
Ultra-low end-to-end latency — no cloud needed.

### 3. Emotion and Tone
Understands and generates whispers, laughter, hesitation, and more.

### 4. Local Deployment
Multi-platform support: NVIDIA GPU, Apple MLX, browser.

## FAQ

**Q: How does it compare to OpenAI's voice mode?**
A: Moshi is open source and runs locally; OpenAI is cloud-based and closed. Latency is comparable.

**Q: Does it support Chinese?**
A: Currently English-first; multi-language is in development.

## Source & Thanks

> [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi) — 8k+ stars, Apache 2.0

---
Source: https://tokrepo.com/en/workflows/moshi-real-time-ai-voice-conversation-engine-6172db11
Author: AI Open Source