ScriptsMay 26, 2026·2 min read

Cactus — Low-Latency AI Inference Engine for Mobile Devices

An open-source C library for running LLM inference on smartphones and wearables with optimized performance for ARM processors and edge hardware.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
Cactus Overview
Direct install command
npx -y tokrepo@latest install 00209c52-58dc-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

Introduction

Cactus is an open-source inference engine designed specifically for running LLMs and speech models on mobile devices and wearables. Built in C with ARM optimizations, it delivers low-latency inference without requiring a cloud connection, making AI capabilities available offline on resource-constrained hardware.

What Cactus Does

  • Runs quantized LLMs on iOS and Android devices
  • Provides speech recognition and text-to-speech on-device
  • Supports GGUF model format for efficient loading
  • Delivers sub-second inference latency on modern mobile processors
  • Offers native bindings for Swift, Kotlin, and React Native

Architecture Overview

Cactus is written in C to maximize portability and minimize overhead. It uses NEON SIMD instructions on ARM processors for matrix multiplication acceleration. The engine supports 4-bit and 8-bit quantized models to fit within mobile memory constraints. A thin platform abstraction layer provides native iOS and Android integration without sacrificing performance.

Self-Hosting & Configuration

  • Build from source with make or CMake for your target platform
  • Use pre-built iOS and Android libraries from releases
  • Load GGUF-format models from local storage
  • Configure thread count and memory limits for your device
  • Integrate via C API, Swift bindings, or Kotlin bindings

Key Features

  • Optimized for ARM processors with NEON SIMD acceleration
  • Supports LLM inference and Whisper-based speech recognition
  • Sub-100ms token generation on modern mobile chips
  • GGUF model format with 4-bit and 8-bit quantization
  • Native bindings for iOS (Swift), Android (Kotlin), and React Native

Comparison with Similar Tools

  • llama.cpp — desktop-focused; Cactus is optimized for mobile ARM targets
  • ExecuTorch — PyTorch ecosystem; Cactus uses GGUF for simpler model deployment
  • MLC-LLM — broader scope; Cactus prioritizes minimal footprint on phones
  • ONNX Runtime Mobile — general ML; Cactus specializes in LLM and speech workloads

FAQ

Q: What models can it run? A: Any GGUF-format model, including Llama, Mistral, Phi, and Whisper variants.

Q: Does it need a GPU? A: No, it runs on the CPU with ARM NEON optimizations. GPU acceleration is optional where available.

Q: What is the minimum device requirement? A: It runs on devices with 2 GB+ RAM using small quantized models (1-3B parameters).

Q: Can I use it in a React Native app? A: Yes, React Native bindings are provided for cross-platform mobile development.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets