How do I install Cactus — Low-Latency AI Inference Engine for Mobile Devices?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cactus — Low-Latency AI Inference Engine for Mobile Devices

Introduction

Cactus is an open-source inference engine designed specifically for running LLMs and speech models on mobile devices and wearables. Built in C with ARM optimizations, it delivers low-latency inference without requiring a cloud connection, making AI capabilities available offline on resource-constrained hardware.

What Cactus Does

Runs quantized LLMs on iOS and Android devices
Provides speech recognition and text-to-speech on-device
Supports GGUF model format for efficient loading
Delivers sub-second inference latency on modern mobile processors
Offers native bindings for Swift, Kotlin, and React Native

Architecture Overview

Cactus is written in C to maximize portability and minimize overhead. It uses NEON SIMD instructions on ARM processors for matrix multiplication acceleration. The engine supports 4-bit and 8-bit quantized models to fit within mobile memory constraints. A thin platform abstraction layer provides native iOS and Android integration without sacrificing performance.

Self-Hosting & Configuration

Build from source with make or CMake for your target platform
Use pre-built iOS and Android libraries from releases
Load GGUF-format models from local storage
Configure thread count and memory limits for your device
Integrate via C API, Swift bindings, or Kotlin bindings

Key Features

Optimized for ARM processors with NEON SIMD acceleration
Supports LLM inference and Whisper-based speech recognition
Sub-100ms token generation on modern mobile chips
GGUF model format with 4-bit and 8-bit quantization
Native bindings for iOS (Swift), Android (Kotlin), and React Native

Comparison with Similar Tools

llama.cpp — desktop-focused; Cactus is optimized for mobile ARM targets
ExecuTorch — PyTorch ecosystem; Cactus uses GGUF for simpler model deployment
MLC-LLM — broader scope; Cactus prioritizes minimal footprint on phones
ONNX Runtime Mobile — general ML; Cactus specializes in LLM and speech workloads

FAQ

Q: What models can it run? A: Any GGUF-format model, including Llama, Mistral, Phi, and Whisper variants.

Q: Does it need a GPU? A: No, it runs on the CPU with ARM NEON optimizations. GPU acceleration is optional where available.

Q: What is the minimum device requirement? A: It runs on devices with 2 GB+ RAM using small quantized models (1-3B parameters).

Q: Can I use it in a React Native app? A: Yes, React Native bindings are provided for cross-platform mobile development.

Sources

https://github.com/cactus-compute/cactus

Cactus — Low-Latency AI Inference Engine for Mobile Devices

Ready-to-run agent install

Introduction

What Cactus Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Mumble — Low-Latency Open-Source Voice Chat for Teams and Gaming

LLRT — Low Latency JavaScript Runtime for Serverless

PyCaret — Low-Code Machine Learning in Python

PipeWire — Next-Generation Audio and Video Framework for Linux