Configs2026年5月24日·1 分钟阅读

FlashInfer — Kernel Library for LLM Serving

High-performance CUDA kernel library providing optimized attention, decoding, and prefill operations for LLM inference engines like vLLM and SGLang.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
FlashInfer
通用 CLI 安装命令
npx tokrepo install d8dc39ce-57ad-11f1-9bc6-00163e2b0d79

Introduction

FlashInfer is a high-performance CUDA kernel library purpose-built for LLM serving workloads. It provides highly optimized implementations of attention mechanisms, KV cache management, and decoding kernels that serve as the computational backbone for inference engines like vLLM, SGLang, and MLC-LLM.

What FlashInfer Does

  • Provides fused attention kernels for prefill and decode phases
  • Implements paged KV cache operations with variable-length sequences
  • Supports grouped-query attention (GQA) and multi-query attention (MQA)
  • Offers batch-level ragged tensor operations for dynamic batching
  • Delivers JIT-compiled kernels adapted to specific hardware and sequence lengths

Architecture Overview

FlashInfer exposes a Python API backed by JIT-compiled CUDA kernels. At runtime, it selects optimal kernel configurations based on batch size, sequence length, head dimensions, and GPU architecture. The library uses a composable page-table abstraction for KV caches, enabling efficient memory management across variable-length sequences without padding waste.

Self-Hosting & Configuration

  • Install pre-built wheels matching your CUDA and PyTorch versions
  • Alternatively build from source with CMake and CUDA toolkit 12.x
  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere or newer)
  • Integrates as a drop-in backend for vLLM and SGLang via their plugin systems
  • No persistent configuration needed; kernel selection is automatic

Key Features

  • Paged attention with zero-copy KV cache access patterns
  • Cascade attention for efficient prefix sharing across requests
  • FP8 compute support on Hopper and Blackwell architectures
  • JIT compilation eliminates overhead of unused kernel variants
  • Supports MoE (Mixture of Experts) dispatch kernels

Comparison with Similar Tools

  • FlashAttention — general-purpose fused attention; FlashInfer specializes in serving with paged KV caches
  • xFormers — training-focused memory-efficient attention; FlashInfer targets inference workloads
  • vLLM built-in kernels — basic implementations; FlashInfer provides faster alternatives vLLM can use
  • TensorRT-LLM kernels — proprietary to NVIDIA's stack; FlashInfer is open and composable

FAQ

Q: Do I need to use FlashInfer directly? A: Most users interact with it through vLLM or SGLang, which use FlashInfer as their attention backend. Direct use is for custom inference engine builders.

Q: Which GPU architectures are supported? A: Ampere (A100, A10), Hopper (H100, H200), and Blackwell (B100, B200). Ada Lovelace (RTX 4090) is also supported.

Q: Does FlashInfer support speculative decoding? A: Yes. It provides batch-verify kernels used in speculative and Medusa-style decoding.

Q: How much speedup does it provide? A: Compared to naive attention implementations, FlashInfer delivers 2-5x speedup depending on sequence length and batch configuration.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产