# MNN — Blazing-Fast On-Device AI Inference by Alibaba > MNN is a lightweight, high-performance inference engine from Alibaba optimized for mobile, embedded, and edge devices with broad model and hardware support. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # MNN — Blazing-Fast On-Device AI Inference by Alibaba ## Quick Use ```bash git clone https://github.com/alibaba/MNN.git cd MNN && mkdir build && cd build cmake .. -DMNN_BUILD_DEMO=ON && make -j$(nproc) # Convert and run an ONNX model ./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model.mnn ./MNNDump2Json model.mnn ``` ## Introduction MNN (Mobile Neural Network) is a high-performance deep learning inference engine built by Alibaba and battle-tested across dozens of Alibaba apps serving billions of requests. It supports on-device LLM inference, vision models, and general neural networks with a focus on minimal latency and memory footprint. ## What MNN Does - Runs neural network inference on mobile CPUs, GPUs, and NPUs with optimized kernels - Supports on-device LLM inference including quantized transformer models - Converts models from PyTorch, ONNX, TensorFlow, and Caffe formats via MNNConvert - Provides an expression API for building and debugging models interactively - Deploys across Android, iOS, Linux, Windows, macOS, and embedded Linux ## Architecture Overview MNN uses a session-based execution model where a network graph is scheduled across heterogeneous backends (CPU, GPU via OpenCL/Vulkan/Metal, NPU). The geometry computation module abstracts operator fusion and memory planning. Kernels are auto-tuned per device at first run, with results cached for subsequent executions. The runtime supports dynamic input shapes and lazy evaluation for efficient memory reuse. ## Self-Hosting & Configuration - Build with CMake; use `-DMNN_OPENCL=ON` or `-DMNN_METAL=ON` for GPU backends - Cross-compile for Android via NDK or iOS via Xcode project files - Use MNNConvert to translate models and apply FP16/INT8 quantization - Configure thread count and backend selection via `ScheduleConfig` at runtime - Integrate into apps through C++, Python, Java, or Objective-C APIs ## Key Features - On-device LLM support with 4-bit quantization for transformer architectures - Hybrid scheduling across CPU, GPU, and NPU backends automatically - Under 2 MB runtime binary with no external dependencies - Expression API enables PyTorch-style model building and debugging - Proven at scale inside Alibaba's production mobile apps ## Comparison with Similar Tools - **ncnn** — Similar mobile focus; MNN adds an expression API and hybrid backend scheduling - **TensorFlow Lite** — Broader ecosystem but larger binary and dependency footprint - **ONNX Runtime** — More general-purpose; MNN is specifically optimized for mobile latency - **OpenVINO** — Targets Intel hardware; MNN targets ARM, Vulkan, and Metal - **llama.cpp** — Specialized for LLMs; MNN handles both LLMs and vision models in one framework ## FAQ **Q: How does MNN compare to ncnn for mobile deployment?** A: Both are high-performance mobile frameworks. MNN offers hybrid GPU/CPU scheduling and an expression API, while ncnn is known for its minimal footprint and Vulkan backend. **Q: Can MNN run large language models on a phone?** A: Yes, MNN supports on-device LLM inference with INT4 quantization, enabling multi-billion-parameter models on modern smartphones. **Q: Which model formats does MNN accept?** A: MNN converts from ONNX, TensorFlow, PyTorch (via ONNX export), Caffe, and TorchScript using the MNNConvert tool. **Q: Is MNN production-ready?** A: Yes, MNN powers AI features across Alibaba's mobile apps including Taobao, serving billions of inference requests daily. ## Sources - https://github.com/alibaba/MNN - https://mnn-docs.readthedocs.io/ --- Source: https://tokrepo.com/en/workflows/mnn-blazing-fast-device-ai-inference-alibaba-0f42114c Author: Script Depot