Configs2026年5月24日·1 分钟阅读

FlashInfer — Kernel Library for LLM Serving

High-performance CUDA kernel library providing optimized attention, decoding, and prefill operations for LLM inference engines like vLLM and SGLang.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
FlashInfer
直接安装命令
npx -y tokrepo@latest install d8dc39ce-57ad-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

FlashInfer is a high-performance CUDA kernel library purpose-built for LLM serving workloads. It provides highly optimized implementations of attention mechanisms, KV cache management, and decoding kernels that serve as the computational backbone for inference engines like vLLM, SGLang, and MLC-LLM.

What FlashInfer Does

  • Provides fused attention kernels for prefill and decode phases
  • Implements paged KV cache operations with variable-length sequences
  • Supports grouped-query attention (GQA) and multi-query attention (MQA)
  • Offers batch-level ragged tensor operations for dynamic batching
  • Delivers JIT-compiled kernels adapted to specific hardware and sequence lengths

Architecture Overview

FlashInfer exposes a Python API backed by JIT-compiled CUDA kernels. At runtime, it selects optimal kernel configurations based on batch size, sequence length, head dimensions, and GPU architecture. The library uses a composable page-table abstraction for KV caches, enabling efficient memory management across variable-length sequences without padding waste.

Self-Hosting & Configuration

  • Install pre-built wheels matching your CUDA and PyTorch versions
  • Alternatively build from source with CMake and CUDA toolkit 12.x
  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere or newer)
  • Integrates as a drop-in backend for vLLM and SGLang via their plugin systems
  • No persistent configuration needed; kernel selection is automatic

Key Features

  • Paged attention with zero-copy KV cache access patterns
  • Cascade attention for efficient prefix sharing across requests
  • FP8 compute support on Hopper and Blackwell architectures
  • JIT compilation eliminates overhead of unused kernel variants
  • Supports MoE (Mixture of Experts) dispatch kernels

Comparison with Similar Tools

  • FlashAttention — general-purpose fused attention; FlashInfer specializes in serving with paged KV caches
  • xFormers — training-focused memory-efficient attention; FlashInfer targets inference workloads
  • vLLM built-in kernels — basic implementations; FlashInfer provides faster alternatives vLLM can use
  • TensorRT-LLM kernels — proprietary to NVIDIA's stack; FlashInfer is open and composable

FAQ

Q: Do I need to use FlashInfer directly? A: Most users interact with it through vLLM or SGLang, which use FlashInfer as their attention backend. Direct use is for custom inference engine builders.

Q: Which GPU architectures are supported? A: Ampere (A100, A10), Hopper (H100, H200), and Blackwell (B100, B200). Ada Lovelace (RTX 4090) is also supported.

Q: Does FlashInfer support speculative decoding? A: Yes. It provides batch-verify kernels used in speculative and Medusa-style decoding.

Q: How much speedup does it provide? A: Compared to naive attention implementations, FlashInfer delivers 2-5x speedup depending on sequence length and batch configuration.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产