Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 24, 2026·3 min de lectura

FlashInfer — Kernel Library for LLM Serving

High-performance CUDA kernel library providing optimized attention, decoding, and prefill operations for LLM inference engines like vLLM and SGLang.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
FlashInfer
Comando CLI universal
npx tokrepo install d8dc39ce-57ad-11f1-9bc6-00163e2b0d79

Introduction

FlashInfer is a high-performance CUDA kernel library purpose-built for LLM serving workloads. It provides highly optimized implementations of attention mechanisms, KV cache management, and decoding kernels that serve as the computational backbone for inference engines like vLLM, SGLang, and MLC-LLM.

What FlashInfer Does

  • Provides fused attention kernels for prefill and decode phases
  • Implements paged KV cache operations with variable-length sequences
  • Supports grouped-query attention (GQA) and multi-query attention (MQA)
  • Offers batch-level ragged tensor operations for dynamic batching
  • Delivers JIT-compiled kernels adapted to specific hardware and sequence lengths

Architecture Overview

FlashInfer exposes a Python API backed by JIT-compiled CUDA kernels. At runtime, it selects optimal kernel configurations based on batch size, sequence length, head dimensions, and GPU architecture. The library uses a composable page-table abstraction for KV caches, enabling efficient memory management across variable-length sequences without padding waste.

Self-Hosting & Configuration

  • Install pre-built wheels matching your CUDA and PyTorch versions
  • Alternatively build from source with CMake and CUDA toolkit 12.x
  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere or newer)
  • Integrates as a drop-in backend for vLLM and SGLang via their plugin systems
  • No persistent configuration needed; kernel selection is automatic

Key Features

  • Paged attention with zero-copy KV cache access patterns
  • Cascade attention for efficient prefix sharing across requests
  • FP8 compute support on Hopper and Blackwell architectures
  • JIT compilation eliminates overhead of unused kernel variants
  • Supports MoE (Mixture of Experts) dispatch kernels

Comparison with Similar Tools

  • FlashAttention — general-purpose fused attention; FlashInfer specializes in serving with paged KV caches
  • xFormers — training-focused memory-efficient attention; FlashInfer targets inference workloads
  • vLLM built-in kernels — basic implementations; FlashInfer provides faster alternatives vLLM can use
  • TensorRT-LLM kernels — proprietary to NVIDIA's stack; FlashInfer is open and composable

FAQ

Q: Do I need to use FlashInfer directly? A: Most users interact with it through vLLM or SGLang, which use FlashInfer as their attention backend. Direct use is for custom inference engine builders.

Q: Which GPU architectures are supported? A: Ampere (A100, A10), Hopper (H100, H200), and Blackwell (B100, B200). Ada Lovelace (RTX 4090) is also supported.

Q: Does FlashInfer support speculative decoding? A: Yes. It provides batch-verify kernels used in speculative and Medusa-style decoding.

Q: How much speedup does it provide? A: Compared to naive attention implementations, FlashInfer delivers 2-5x speedup depending on sequence length and batch configuration.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados