What is FlashInfer — Kernel Library for LLM Serving?

High-performance CUDA kernel library providing optimized attention, decoding, and prefill operations for LLM inference engines like vLLM and SGLang.

Is FlashInfer — Kernel Library for LLM Serving free to use?

Yes. FlashInfer — Kernel Library for LLM Serving is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install FlashInfer — Kernel Library for LLM Serving?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

FlashInfer — Kernel Library for LLM Serving

Introduction

FlashInfer is a high-performance CUDA kernel library purpose-built for LLM serving workloads. It provides highly optimized implementations of attention mechanisms, KV cache management, and decoding kernels that serve as the computational backbone for inference engines like vLLM, SGLang, and MLC-LLM.

What FlashInfer Does

Provides fused attention kernels for prefill and decode phases
Implements paged KV cache operations with variable-length sequences
Supports grouped-query attention (GQA) and multi-query attention (MQA)
Offers batch-level ragged tensor operations for dynamic batching
Delivers JIT-compiled kernels adapted to specific hardware and sequence lengths

Architecture Overview

FlashInfer exposes a Python API backed by JIT-compiled CUDA kernels. At runtime, it selects optimal kernel configurations based on batch size, sequence length, head dimensions, and GPU architecture. The library uses a composable page-table abstraction for KV caches, enabling efficient memory management across variable-length sequences without padding waste.

Self-Hosting & Configuration

Install pre-built wheels matching your CUDA and PyTorch versions
Alternatively build from source with CMake and CUDA toolkit 12.x
Requires NVIDIA GPUs with compute capability 8.0+ (Ampere or newer)
Integrates as a drop-in backend for vLLM and SGLang via their plugin systems
No persistent configuration needed; kernel selection is automatic

Key Features

Paged attention with zero-copy KV cache access patterns
Cascade attention for efficient prefix sharing across requests
FP8 compute support on Hopper and Blackwell architectures
JIT compilation eliminates overhead of unused kernel variants
Supports MoE (Mixture of Experts) dispatch kernels

Comparison with Similar Tools

FlashAttention — general-purpose fused attention; FlashInfer specializes in serving with paged KV caches
xFormers — training-focused memory-efficient attention; FlashInfer targets inference workloads
vLLM built-in kernels — basic implementations; FlashInfer provides faster alternatives vLLM can use
TensorRT-LLM kernels — proprietary to NVIDIA's stack; FlashInfer is open and composable

FAQ

Q: Do I need to use FlashInfer directly? A: Most users interact with it through vLLM or SGLang, which use FlashInfer as their attention backend. Direct use is for custom inference engine builders.

Q: Which GPU architectures are supported? A: Ampere (A100, A10), Hopper (H100, H200), and Blackwell (B100, B200). Ada Lovelace (RTX 4090) is also supported.

Q: Does FlashInfer support speculative decoding? A: Yes. It provides batch-verify kernels used in speculative and Medusa-style decoding.

Q: How much speedup does it provide? A: Compared to naive attention implementations, FlashInfer delivers 2-5x speedup depending on sequence length and batch configuration.

FlashInfer — Kernel Library for LLM Serving

Este activo puede ser leído e instalado directamente por agents

Introduction

What FlashInfer Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

Liger-Kernel — Efficient GPU Kernels for LLM Training

Mihomo — High-Performance Rule-Based Network Proxy Kernel

DeepSpeed — Deep Learning Optimization Library by Microsoft

Google Benchmark — Microbenchmark Library for C++