PowerInfer — High-Speed Local LLM Inference via Activation Locality

Introduction

PowerInfer is an inference engine for large language models that achieves high speed on consumer-grade hardware by exploiting activation locality. Research shows that LLMs consistently activate only a small fraction of neurons per token; PowerInfer keeps these hot neurons on the GPU while offloading cold neurons to the CPU, dramatically reducing GPU memory requirements.

What PowerInfer Does

Runs LLMs on consumer GPUs by splitting computation between GPU and CPU based on neuron activation patterns
Achieves up to 11x speedup over llama.cpp on mixed CPU/GPU setups
Uses offline profiling to build activation predictors for each model
Supports popular model architectures including LLaMA, Falcon, and Mistral
Provides a llama.cpp-compatible interface for easy migration

Architecture Overview

PowerInfer profiles a model offline to identify which neurons are frequently activated (hot) versus rarely activated (cold). At inference time, hot neurons reside in GPU memory for fast computation while cold neurons stay in CPU RAM. A lightweight predictor determines which neurons to activate per token, skipping the rest. This adaptive neuron-level offloading keeps GPU memory usage low while maintaining generation quality.

Self-Hosting & Configuration

Build from source with CMake; supports CUDA for NVIDIA GPUs
Download pre-converted GGUF models or convert from Hugging Face format
Run the profiling tool on a calibration dataset to generate neuron activation statistics
Configure the GPU/CPU split ratio based on available VRAM
Compatible with llama.cpp model format and most of its command-line options

Key Features

Up to 11x faster than llama.cpp for CPU/GPU hybrid inference on consumer hardware
Neuron-level offloading preserves model quality while reducing memory footprint
Offline profiling amortizes analysis cost across many inference runs
Compatible with GGUF model format and quantization schemes
Supports batch processing and interactive chat modes

Comparison with Similar Tools

llama.cpp — general-purpose CPU/GPU inference; PowerInfer adds activation-aware scheduling for faster hybrid execution
ExLlamaV2 — optimized GPU-only quantized inference; PowerInfer targets scenarios where the model exceeds GPU memory
vLLM — high-throughput server-grade serving; PowerInfer focuses on single-user consumer hardware
Ollama — user-friendly LLM runner built on llama.cpp; PowerInfer offers raw performance gains at the cost of setup complexity
Petals — distributes across multiple machines; PowerInfer maximizes throughput on a single machine

FAQ

Q: Which models benefit most from PowerInfer? A: Models with strong activation locality (most MLP-heavy architectures like LLaMA and Falcon) see the largest speedups. Dense attention layers benefit less.

Q: Do I need to re-profile when changing hardware? A: The activation profiles are model-specific, not hardware-specific. You only need to adjust the GPU/CPU memory split for different hardware configurations.

Q: Does activation skipping affect output quality? A: The predictor achieves over 95% accuracy in neuron activation prediction. In practice, output quality is indistinguishable from full inference.

Q: Can I use PowerInfer for serving multiple users? A: PowerInfer is optimized for single-user latency. For multi-user serving, consider vLLM or TGI with dedicated GPU resources.

PowerInfer — High-Speed Local LLM Inference via Activation Locality

Introduction

What PowerInfer Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Shimmy — Python-Free Rust Inference Server for Local LLMs

Worktrunk — Git Worktree Manager for Parallel AI Agents

microsandbox — Secure Local Sandboxes for AI Agents