Scripts2026年5月26日·1 分钟阅读

Cactus — Low-Latency AI Inference Engine for Mobile Devices

An open-source C library for running LLM inference on smartphones and wearables with optimized performance for ARM processors and edge hardware.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Cactus Overview
直接安装命令
npx -y tokrepo@latest install 00209c52-58dc-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

Cactus is an open-source inference engine designed specifically for running LLMs and speech models on mobile devices and wearables. Built in C with ARM optimizations, it delivers low-latency inference without requiring a cloud connection, making AI capabilities available offline on resource-constrained hardware.

What Cactus Does

  • Runs quantized LLMs on iOS and Android devices
  • Provides speech recognition and text-to-speech on-device
  • Supports GGUF model format for efficient loading
  • Delivers sub-second inference latency on modern mobile processors
  • Offers native bindings for Swift, Kotlin, and React Native

Architecture Overview

Cactus is written in C to maximize portability and minimize overhead. It uses NEON SIMD instructions on ARM processors for matrix multiplication acceleration. The engine supports 4-bit and 8-bit quantized models to fit within mobile memory constraints. A thin platform abstraction layer provides native iOS and Android integration without sacrificing performance.

Self-Hosting & Configuration

  • Build from source with make or CMake for your target platform
  • Use pre-built iOS and Android libraries from releases
  • Load GGUF-format models from local storage
  • Configure thread count and memory limits for your device
  • Integrate via C API, Swift bindings, or Kotlin bindings

Key Features

  • Optimized for ARM processors with NEON SIMD acceleration
  • Supports LLM inference and Whisper-based speech recognition
  • Sub-100ms token generation on modern mobile chips
  • GGUF model format with 4-bit and 8-bit quantization
  • Native bindings for iOS (Swift), Android (Kotlin), and React Native

Comparison with Similar Tools

  • llama.cpp — desktop-focused; Cactus is optimized for mobile ARM targets
  • ExecuTorch — PyTorch ecosystem; Cactus uses GGUF for simpler model deployment
  • MLC-LLM — broader scope; Cactus prioritizes minimal footprint on phones
  • ONNX Runtime Mobile — general ML; Cactus specializes in LLM and speech workloads

FAQ

Q: What models can it run? A: Any GGUF-format model, including Llama, Mistral, Phi, and Whisper variants.

Q: Does it need a GPU? A: No, it runs on the CPU with ARM NEON optimizations. GPU acceleration is optional where available.

Q: What is the minimum device requirement? A: It runs on devices with 2 GB+ RAM using small quantized models (1-3B parameters).

Q: Can I use it in a React Native app? A: Yes, React Native bindings are provided for cross-platform mobile development.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产