Introduction
FlashInfer is a high-performance CUDA kernel library purpose-built for LLM serving workloads. It provides highly optimized implementations of attention mechanisms, KV cache management, and decoding kernels that serve as the computational backbone for inference engines like vLLM, SGLang, and MLC-LLM.
What FlashInfer Does
- Provides fused attention kernels for prefill and decode phases
- Implements paged KV cache operations with variable-length sequences
- Supports grouped-query attention (GQA) and multi-query attention (MQA)
- Offers batch-level ragged tensor operations for dynamic batching
- Delivers JIT-compiled kernels adapted to specific hardware and sequence lengths
Architecture Overview
FlashInfer exposes a Python API backed by JIT-compiled CUDA kernels. At runtime, it selects optimal kernel configurations based on batch size, sequence length, head dimensions, and GPU architecture. The library uses a composable page-table abstraction for KV caches, enabling efficient memory management across variable-length sequences without padding waste.
Self-Hosting & Configuration
- Install pre-built wheels matching your CUDA and PyTorch versions
- Alternatively build from source with CMake and CUDA toolkit 12.x
- Requires NVIDIA GPUs with compute capability 8.0+ (Ampere or newer)
- Integrates as a drop-in backend for vLLM and SGLang via their plugin systems
- No persistent configuration needed; kernel selection is automatic
Key Features
- Paged attention with zero-copy KV cache access patterns
- Cascade attention for efficient prefix sharing across requests
- FP8 compute support on Hopper and Blackwell architectures
- JIT compilation eliminates overhead of unused kernel variants
- Supports MoE (Mixture of Experts) dispatch kernels
Comparison with Similar Tools
- FlashAttention — general-purpose fused attention; FlashInfer specializes in serving with paged KV caches
- xFormers — training-focused memory-efficient attention; FlashInfer targets inference workloads
- vLLM built-in kernels — basic implementations; FlashInfer provides faster alternatives vLLM can use
- TensorRT-LLM kernels — proprietary to NVIDIA's stack; FlashInfer is open and composable
FAQ
Q: Do I need to use FlashInfer directly? A: Most users interact with it through vLLM or SGLang, which use FlashInfer as their attention backend. Direct use is for custom inference engine builders.
Q: Which GPU architectures are supported? A: Ampere (A100, A10), Hopper (H100, H200), and Blackwell (B100, B200). Ada Lovelace (RTX 4090) is also supported.
Q: Does FlashInfer support speculative decoding? A: Yes. It provides batch-verify kernels used in speculative and Medusa-style decoding.
Q: How much speedup does it provide? A: Compared to naive attention implementations, FlashInfer delivers 2-5x speedup depending on sequence length and batch configuration.