Introduction
Cactus is an open-source inference engine designed specifically for running LLMs and speech models on mobile devices and wearables. Built in C with ARM optimizations, it delivers low-latency inference without requiring a cloud connection, making AI capabilities available offline on resource-constrained hardware.
What Cactus Does
- Runs quantized LLMs on iOS and Android devices
- Provides speech recognition and text-to-speech on-device
- Supports GGUF model format for efficient loading
- Delivers sub-second inference latency on modern mobile processors
- Offers native bindings for Swift, Kotlin, and React Native
Architecture Overview
Cactus is written in C to maximize portability and minimize overhead. It uses NEON SIMD instructions on ARM processors for matrix multiplication acceleration. The engine supports 4-bit and 8-bit quantized models to fit within mobile memory constraints. A thin platform abstraction layer provides native iOS and Android integration without sacrificing performance.
Self-Hosting & Configuration
- Build from source with make or CMake for your target platform
- Use pre-built iOS and Android libraries from releases
- Load GGUF-format models from local storage
- Configure thread count and memory limits for your device
- Integrate via C API, Swift bindings, or Kotlin bindings
Key Features
- Optimized for ARM processors with NEON SIMD acceleration
- Supports LLM inference and Whisper-based speech recognition
- Sub-100ms token generation on modern mobile chips
- GGUF model format with 4-bit and 8-bit quantization
- Native bindings for iOS (Swift), Android (Kotlin), and React Native
Comparison with Similar Tools
- llama.cpp — desktop-focused; Cactus is optimized for mobile ARM targets
- ExecuTorch — PyTorch ecosystem; Cactus uses GGUF for simpler model deployment
- MLC-LLM — broader scope; Cactus prioritizes minimal footprint on phones
- ONNX Runtime Mobile — general ML; Cactus specializes in LLM and speech workloads
FAQ
Q: What models can it run? A: Any GGUF-format model, including Llama, Mistral, Phi, and Whisper variants.
Q: Does it need a GPU? A: No, it runs on the CPU with ARM NEON optimizations. GPU acceleration is optional where available.
Q: What is the minimum device requirement? A: It runs on devices with 2 GB+ RAM using small quantized models (1-3B parameters).
Q: Can I use it in a React Native app? A: Yes, React Native bindings are provided for cross-platform mobile development.