# MNN — Blazing-Fast On-Device AI Inference by Alibaba

> MNN is a lightweight, high-performance inference engine from Alibaba optimized for mobile, embedded, and edge devices with broad model and hardware support.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# MNN — Blazing-Fast On-Device AI Inference by Alibaba

## Quick Use
```bash
git clone https://github.com/alibaba/MNN.git
cd MNN && mkdir build && cd build
cmake .. -DMNN_BUILD_DEMO=ON && make -j$(nproc)
# Convert and run an ONNX model
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model.mnn
./MNNDump2Json model.mnn
```

## Introduction
MNN (Mobile Neural Network) is a high-performance deep learning inference engine built by Alibaba and battle-tested across dozens of Alibaba apps serving billions of requests. It supports on-device LLM inference, vision models, and general neural networks with a focus on minimal latency and memory footprint.

## What MNN Does
- Runs neural network inference on mobile CPUs, GPUs, and NPUs with optimized kernels
- Supports on-device LLM inference including quantized transformer models
- Converts models from PyTorch, ONNX, TensorFlow, and Caffe formats via MNNConvert
- Provides an expression API for building and debugging models interactively
- Deploys across Android, iOS, Linux, Windows, macOS, and embedded Linux

## Architecture Overview
MNN uses a session-based execution model where a network graph is scheduled across heterogeneous backends (CPU, GPU via OpenCL/Vulkan/Metal, NPU). The geometry computation module abstracts operator fusion and memory planning. Kernels are auto-tuned per device at first run, with results cached for subsequent executions. The runtime supports dynamic input shapes and lazy evaluation for efficient memory reuse.

## Self-Hosting & Configuration
- Build with CMake; use `-DMNN_OPENCL=ON` or `-DMNN_METAL=ON` for GPU backends
- Cross-compile for Android via NDK or iOS via Xcode project files
- Use MNNConvert to translate models and apply FP16/INT8 quantization
- Configure thread count and backend selection via `ScheduleConfig` at runtime
- Integrate into apps through C++, Python, Java, or Objective-C APIs

## Key Features
- On-device LLM support with 4-bit quantization for transformer architectures
- Hybrid scheduling across CPU, GPU, and NPU backends automatically
- Under 2 MB runtime binary with no external dependencies
- Expression API enables PyTorch-style model building and debugging
- Proven at scale inside Alibaba's production mobile apps

## Comparison with Similar Tools
- **ncnn** — Similar mobile focus; MNN adds an expression API and hybrid backend scheduling
- **TensorFlow Lite** — Broader ecosystem but larger binary and dependency footprint
- **ONNX Runtime** — More general-purpose; MNN is specifically optimized for mobile latency
- **OpenVINO** — Targets Intel hardware; MNN targets ARM, Vulkan, and Metal
- **llama.cpp** — Specialized for LLMs; MNN handles both LLMs and vision models in one framework

## FAQ
**Q: How does MNN compare to ncnn for mobile deployment?**
A: Both are high-performance mobile frameworks. MNN offers hybrid GPU/CPU scheduling and an expression API, while ncnn is known for its minimal footprint and Vulkan backend.

**Q: Can MNN run large language models on a phone?**
A: Yes, MNN supports on-device LLM inference with INT4 quantization, enabling multi-billion-parameter models on modern smartphones.

**Q: Which model formats does MNN accept?**
A: MNN converts from ONNX, TensorFlow, PyTorch (via ONNX export), Caffe, and TorchScript using the MNNConvert tool.

**Q: Is MNN production-ready?**
A: Yes, MNN powers AI features across Alibaba's mobile apps including Taobao, serving billions of inference requests daily.

## Sources
- https://github.com/alibaba/MNN
- https://mnn-docs.readthedocs.io/

---
Source: https://tokrepo.com/en/workflows/mnn-blazing-fast-device-ai-inference-alibaba-0f42114c
Author: Script Depot