# Moshi — Real-Time AI Voice Conversation Engine

> Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

## Install

Save in your project root:

## Quick Use

```bash
pip install moshi
python -m moshi.server
```

Open `http://localhost:8998` — start talking to Moshi in real-time.

## What is Moshi?

Moshi is an open-source real-time voice AI engine by Kyutai. It enables full-duplex speech conversations with ~200ms latency — meaning you can interrupt, overlap, and have natural back-and-forth dialog with an AI. It runs on-device with no cloud dependency.

**Answer-Ready**: Moshi is an open-source real-time voice AI engine by Kyutai with full-duplex speech conversation at 200ms latency. Supports interruptions, emotion recognition, and on-device processing. Apache 2.0 licensed with 8k+ GitHub stars.

**Best for**: Developers building voice-first AI applications. **Works with**: Local GPU (NVIDIA), Apple MLX, web browser. **Setup time**: Under 5 minutes.

## Core Features

### 1. Full-Duplex Conversation
Unlike turn-based voice assistants, Moshi handles overlapping speech:
- You can interrupt mid-sentence
- Moshi responds while you're still talking
- Natural conversation flow like a human call

### 2. Ultra-Low Latency
End-to-end latency breakdown:
```
Speech recognition:  ~50ms
Language model:      ~100ms
Speech synthesis:    ~50ms
Total:               ~200ms
```

### 3. Architecture
Joint speech-text model — no separate ASR + LLM + TTS pipeline:

```
Audio input → Mimi Encoder → Helium LM → Mimi Decoder → Audio output
                                ↕
                          Text reasoning
```

- **Mimi**: Neural audio codec (12.5 Hz, 1.1 kbps)
- **Helium**: 7B parameter multimodal language model

### 4. Emotion & Tone
Moshi understands and generates:
- Whispers, laughter, hesitation
- Emotional tone (excited, calm, serious)
- Multiple speaking styles

### 5. Deployment Options

| Platform | How |
|----------|-----|
| Python server | `python -m moshi.server` |
| Rust server | High-performance production deployment |
| Web client | Browser-based demo |
| MLX | Apple Silicon optimized |

## Hardware Requirements

| GPU | Model Size | Latency |
|-----|-----------|---------|
| NVIDIA A100 | 7B | ~160ms |
| NVIDIA RTX 4090 | 7B | ~200ms |
| Apple M2 Ultra | 7B (MLX) | ~300ms |

## FAQ

**Q: How does it compare to OpenAI's voice mode?**
A: Moshi is open-source and runs locally. OpenAI's voice mode is cloud-only and proprietary. Moshi has comparable latency.

**Q: Can I fine-tune it?**
A: Yes, both the Mimi codec and Helium LM can be fine-tuned for custom voice personas and domains.

**Q: Does it support multiple languages?**
A: Currently optimized for English. Multilingual support is in development.

## Source & Thanks

> Created by [Kyutai](https://github.com/kyutai-labs). Licensed under Apache 2.0.
>
> [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi) — 8k+ stars

<!-- ZH -->

## 快速使用

```bash
pip install moshi
python -m moshi.server
```

浏览器打开 `localhost:8998` 开始实时语音对话。

## 什么是 Moshi？

Moshi 是 Kyutai 开源的实时语音 AI 引擎，支持全双工对话、200ms 延迟、情感识别和本地运行。

**一句话总结**：开源实时语音 AI，全双工对话 200ms 延迟，支持打断和情感识别，本地运行，8k+ GitHub stars。

**适合人群**：构建语音优先 AI 应用的开发者。

## 核心功能

### 1. 全双工对话
支持打断和重叠语音，如自然对话。

### 2. 200ms 延迟
端到端超低延迟，无需云端。

### 3. 情感与语气
理解并生成耳语、笑声、犹豫等。

### 4. 本地部署
NVIDIA GPU、Apple MLX、浏览器多平台支持。

## 常见问题

**Q: 和 OpenAI 语音模式比较？**
A: Moshi 开源本地运行，OpenAI 云端闭源。延迟相当。

**Q: 支持中文吗？**
A: 目前英文优先，多语言开发中。

## 来源与致谢

> [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi) — 8k+ stars, Apache 2.0

---
Source: https://tokrepo.com/en/workflows/6172db11-6b8c-431b-8f66-f4b7af585534
Author: AI Open Source