Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 26, 2026·2 min de lecture

Cactus — Low-Latency AI Inference Engine for Mobile Devices

An open-source C library for running LLM inference on smartphones and wearables with optimized performance for ARM processors and edge hardware.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Cactus Overview
Commande d'installation directe
npx -y tokrepo@latest install 00209c52-58dc-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

Cactus is an open-source inference engine designed specifically for running LLMs and speech models on mobile devices and wearables. Built in C with ARM optimizations, it delivers low-latency inference without requiring a cloud connection, making AI capabilities available offline on resource-constrained hardware.

What Cactus Does

  • Runs quantized LLMs on iOS and Android devices
  • Provides speech recognition and text-to-speech on-device
  • Supports GGUF model format for efficient loading
  • Delivers sub-second inference latency on modern mobile processors
  • Offers native bindings for Swift, Kotlin, and React Native

Architecture Overview

Cactus is written in C to maximize portability and minimize overhead. It uses NEON SIMD instructions on ARM processors for matrix multiplication acceleration. The engine supports 4-bit and 8-bit quantized models to fit within mobile memory constraints. A thin platform abstraction layer provides native iOS and Android integration without sacrificing performance.

Self-Hosting & Configuration

  • Build from source with make or CMake for your target platform
  • Use pre-built iOS and Android libraries from releases
  • Load GGUF-format models from local storage
  • Configure thread count and memory limits for your device
  • Integrate via C API, Swift bindings, or Kotlin bindings

Key Features

  • Optimized for ARM processors with NEON SIMD acceleration
  • Supports LLM inference and Whisper-based speech recognition
  • Sub-100ms token generation on modern mobile chips
  • GGUF model format with 4-bit and 8-bit quantization
  • Native bindings for iOS (Swift), Android (Kotlin), and React Native

Comparison with Similar Tools

  • llama.cpp — desktop-focused; Cactus is optimized for mobile ARM targets
  • ExecuTorch — PyTorch ecosystem; Cactus uses GGUF for simpler model deployment
  • MLC-LLM — broader scope; Cactus prioritizes minimal footprint on phones
  • ONNX Runtime Mobile — general ML; Cactus specializes in LLM and speech workloads

FAQ

Q: What models can it run? A: Any GGUF-format model, including Llama, Mistral, Phi, and Whisper variants.

Q: Does it need a GPU? A: No, it runs on the CPU with ARM NEON optimizations. GPU acceleration is optional where available.

Q: What is the minimum device requirement? A: It runs on devices with 2 GB+ RAM using small quantized models (1-3B parameters).

Q: Can I use it in a React Native app? A: Yes, React Native bindings are provided for cross-platform mobile development.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires