Introduction
ncnn is a high-performance neural network inference computing framework developed by Tencent and optimized specifically for mobile and embedded platforms. It has zero third-party dependencies and can be cross-compiled for Android, iOS, and various ARM-based boards, making it a go-to choice for on-device AI.
What ncnn Does
- Runs neural network inference on mobile CPUs and GPUs with near-native speed
- Supports ARM NEON, Vulkan GPU, x86 SSE/AVX, and RISC-V vector acceleration
- Converts models from PyTorch, ONNX, TensorFlow, Caffe, and MXNet formats
- Provides quantization tools (INT8) to reduce model size and improve throughput
- Ships a minimal runtime under 1 MB for resource-constrained devices
Architecture Overview
ncnn uses a layer-based computation graph where each operator is hand-optimized with platform-specific SIMD intrinsics. The Vulkan compute backend enables GPU inference on mobile devices without requiring CUDA. Memory is managed via a blob allocator that reuses scratch buffers across layers to minimize peak usage. The framework supports dynamic shapes and can fuse operations at load time for faster execution.
Self-Hosting & Configuration
- Build from source with CMake; enable
-DNCNN_VULKAN=ONfor GPU support - Cross-compile for Android using the NDK toolchain file shipped in the repo
- Use
ncnnoptimizeto strip unused layers and fuse operations in exported models - Convert ONNX models via
onnx2ncnn; calibrate INT8 quantization with a small dataset - Integrate into iOS projects via CocoaPods or by adding the framework directly
Key Features
- Zero external dependencies — pure C++ with optional Vulkan
- Vulkan-based GPU inference works on Android, iOS, macOS, Linux, and Windows
- INT8 and FP16 quantization with calibration tools included
- Extensive model zoo with ready-to-use detection, segmentation, and recognition examples
- Active community with over 23,000 GitHub stars and regular releases
Comparison with Similar Tools
- MNN — Also targets mobile; MNN has a built-in expression API while ncnn focuses on raw inference speed
- TensorFlow Lite — Google's mobile runtime; heavier dependency footprint than ncnn
- ONNX Runtime Mobile — Broader model compatibility but larger binary size
- OpenVINO — Optimized for Intel hardware; ncnn targets ARM and Vulkan GPUs
- MLC-LLM — Focused on LLM deployment; ncnn handles general vision and NLP models
FAQ
Q: Which platforms does ncnn support? A: Android, iOS, macOS, Windows, Linux, and various embedded Linux boards with ARM or RISC-V CPUs.
Q: Can ncnn run large language models? A: ncnn is optimized for vision and lightweight NLP models. For LLM inference on device, consider pairing it with a dedicated LLM runtime.
Q: How do I convert a PyTorch model to ncnn format?
A: Export from PyTorch to ONNX first, then use the onnx2ncnn tool included in the repo to produce .param and .bin files.
Q: Is GPU inference supported on mobile? A: Yes, ncnn uses Vulkan for GPU compute on both Android and iOS devices that support it.