# Cactus — Low-Latency AI Inference Engine for Mobile Devices

> An open-source C library for running LLM inference on smartphones and wearables with optimized performance for ARM processors and edge hardware.

## Install

Save as a script file and run:

# Cactus — Low-Latency AI Inference Engine for Mobile Devices

## Quick Use
```bash
git clone https://github.com/cactus-compute/cactus.git
cd cactus
make
# For iOS/Android, use the platform-specific build targets
```

## Introduction
Cactus is an open-source inference engine designed specifically for running LLMs and speech models on mobile devices and wearables. Built in C with ARM optimizations, it delivers low-latency inference without requiring a cloud connection, making AI capabilities available offline on resource-constrained hardware.

## What Cactus Does
- Runs quantized LLMs on iOS and Android devices
- Provides speech recognition and text-to-speech on-device
- Supports GGUF model format for efficient loading
- Delivers sub-second inference latency on modern mobile processors
- Offers native bindings for Swift, Kotlin, and React Native

## Architecture Overview
Cactus is written in C to maximize portability and minimize overhead. It uses NEON SIMD instructions on ARM processors for matrix multiplication acceleration. The engine supports 4-bit and 8-bit quantized models to fit within mobile memory constraints. A thin platform abstraction layer provides native iOS and Android integration without sacrificing performance.

## Self-Hosting & Configuration
- Build from source with make or CMake for your target platform
- Use pre-built iOS and Android libraries from releases
- Load GGUF-format models from local storage
- Configure thread count and memory limits for your device
- Integrate via C API, Swift bindings, or Kotlin bindings

## Key Features
- Optimized for ARM processors with NEON SIMD acceleration
- Supports LLM inference and Whisper-based speech recognition
- Sub-100ms token generation on modern mobile chips
- GGUF model format with 4-bit and 8-bit quantization
- Native bindings for iOS (Swift), Android (Kotlin), and React Native

## Comparison with Similar Tools
- **llama.cpp** — desktop-focused; Cactus is optimized for mobile ARM targets
- **ExecuTorch** — PyTorch ecosystem; Cactus uses GGUF for simpler model deployment
- **MLC-LLM** — broader scope; Cactus prioritizes minimal footprint on phones
- **ONNX Runtime Mobile** — general ML; Cactus specializes in LLM and speech workloads

## FAQ
**Q: What models can it run?**
A: Any GGUF-format model, including Llama, Mistral, Phi, and Whisper variants.

**Q: Does it need a GPU?**
A: No, it runs on the CPU with ARM NEON optimizations. GPU acceleration is optional where available.

**Q: What is the minimum device requirement?**
A: It runs on devices with 2 GB+ RAM using small quantized models (1-3B parameters).

**Q: Can I use it in a React Native app?**
A: Yes, React Native bindings are provided for cross-platform mobile development.

## Sources
- https://github.com/cactus-compute/cactus

---
Source: https://tokrepo.com/en/workflows/asset-00209c52
Author: Script Depot