# Moshi — Real-Time AI Voice Conversation Engine > Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed. ## Install Save in your project root: ## Quick Use ```bash pip install moshi python -m moshi.server ``` Open `http://localhost:8998` — start talking to Moshi in real-time. ## What is Moshi? Moshi is an open-source real-time voice AI engine by Kyutai. It enables full-duplex speech conversations with ~200ms latency — meaning you can interrupt, overlap, and have natural back-and-forth dialog with an AI. It runs on-device with no cloud dependency. **Answer-Ready**: Moshi is an open-source real-time voice AI engine by Kyutai with full-duplex speech conversation at 200ms latency. Supports interruptions, emotion recognition, and on-device processing. Apache 2.0 licensed with 8k+ GitHub stars. **Best for**: Developers building voice-first AI applications. **Works with**: Local GPU (NVIDIA), Apple MLX, web browser. **Setup time**: Under 5 minutes. ## Core Features ### 1. Full-Duplex Conversation Unlike turn-based voice assistants, Moshi handles overlapping speech: - You can interrupt mid-sentence - Moshi responds while you're still talking - Natural conversation flow like a human call ### 2. Ultra-Low Latency End-to-end latency breakdown: ``` Speech recognition: ~50ms Language model: ~100ms Speech synthesis: ~50ms Total: ~200ms ``` ### 3. Architecture Joint speech-text model — no separate ASR + LLM + TTS pipeline: ``` Audio input → Mimi Encoder → Helium LM → Mimi Decoder → Audio output ↕ Text reasoning ``` - **Mimi**: Neural audio codec (12.5 Hz, 1.1 kbps) - **Helium**: 7B parameter multimodal language model ### 4. Emotion & Tone Moshi understands and generates: - Whispers, laughter, hesitation - Emotional tone (excited, calm, serious) - Multiple speaking styles ### 5. Deployment Options | Platform | How | |----------|-----| | Python server | `python -m moshi.server` | | Rust server | High-performance production deployment | | Web client | Browser-based demo | | MLX | Apple Silicon optimized | ## Hardware Requirements | GPU | Model Size | Latency | |-----|-----------|---------| | NVIDIA A100 | 7B | ~160ms | | NVIDIA RTX 4090 | 7B | ~200ms | | Apple M2 Ultra | 7B (MLX) | ~300ms | ## FAQ **Q: How does it compare to OpenAI's voice mode?** A: Moshi is open-source and runs locally. OpenAI's voice mode is cloud-only and proprietary. Moshi has comparable latency. **Q: Can I fine-tune it?** A: Yes, both the Mimi codec and Helium LM can be fine-tuned for custom voice personas and domains. **Q: Does it support multiple languages?** A: Currently optimized for English. Multilingual support is in development. ## Source & Thanks > Created by [Kyutai](https://github.com/kyutai-labs). Licensed under Apache 2.0. > > [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi) — 8k+ stars ## 快速使用 ```bash pip install moshi python -m moshi.server ``` 浏览器打开 `localhost:8998` 开始实时语音对话。 ## 什么是 Moshi? Moshi 是 Kyutai 开源的实时语音 AI 引擎,支持全双工对话、200ms 延迟、情感识别和本地运行。 **一句话总结**:开源实时语音 AI,全双工对话 200ms 延迟,支持打断和情感识别,本地运行,8k+ GitHub stars。 **适合人群**:构建语音优先 AI 应用的开发者。 ## 核心功能 ### 1. 全双工对话 支持打断和重叠语音,如自然对话。 ### 2. 200ms 延迟 端到端超低延迟,无需云端。 ### 3. 情感与语气 理解并生成耳语、笑声、犹豫等。 ### 4. 本地部署 NVIDIA GPU、Apple MLX、浏览器多平台支持。 ## 常见问题 **Q: 和 OpenAI 语音模式比较?** A: Moshi 开源本地运行,OpenAI 云端闭源。延迟相当。 **Q: 支持中文吗?** A: 目前英文优先,多语言开发中。 ## 来源与致谢 > [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi) — 8k+ stars, Apache 2.0 --- Source: https://tokrepo.com/en/workflows/6172db11-6b8c-431b-8f66-f4b7af585534 Author: AI Open Source