How do I install FlashAttention — Fast and Memory-Efficient Exact Attention?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

FlashAttention — Fast and Memory-Efficient Exact Attention

Introduction

FlashAttention is a fast, memory-efficient attention algorithm developed at Princeton and Stanford. Instead of materializing the full N x N attention matrix in GPU HBM, it uses tiling and kernel fusion to compute exact attention in SRAM, dramatically reducing memory usage and wall-clock time.

What FlashAttention Does

Computes exact multi-head self-attention 2-4x faster than standard PyTorch implementations
Reduces memory footprint from O(N^2) to O(N) in sequence length
Supports causal masking, variable-length sequences, and grouped-query attention
Provides both forward and backward pass kernels for training workloads
Integrates as a drop-in replacement in transformer architectures

Architecture Overview

FlashAttention tiles the Q, K, V matrices into blocks that fit in GPU SRAM and computes attention block by block using the online softmax trick. This avoids writing the large intermediate attention matrix to slow HBM. FlashAttention-2 further optimizes parallelism across sequence length and reduces non-matmul FLOPs. FlashAttention-3 targets Hopper architecture GPUs with asynchronous operations.

Self-Hosting & Configuration

Requires NVIDIA GPU with compute capability 8.0+ (Ampere, Ada, Hopper)
CUDA 11.6+ and PyTorch 1.12+ needed for compilation
Installation compiles custom CUDA kernels; use --no-build-isolation with pip
Supports FP16 and BF16 data types
Pre-built wheels available for common CUDA/PyTorch version combinations

Key Features

Exact attention (not an approximation) with IO-aware tiling strategy
Enables training with much longer context lengths on the same hardware
FlashAttention-2 achieves up to 230 TFLOPs/s on A100 (73% of theoretical peak)
Natively supports multi-query and grouped-query attention patterns
Used by Hugging Face Transformers, vLLM, xFormers, and most major LLM frameworks

Comparison with Similar Tools

PyTorch scaled_dot_product_attention — incorporates FlashAttention as one backend; using flash-attn directly gives access to newer features
xFormers memory_efficient_attention — Meta's attention library; FlashAttention provides the underlying kernel for xFormers
Ring Attention — distributes attention across devices for very long sequences; complementary to FlashAttention
PagedAttention (vLLM) — optimizes KV cache for inference serving; FlashAttention optimizes the attention computation itself

FAQ

Q: Does FlashAttention change model outputs? A: No. It computes mathematically exact attention, producing identical results to standard implementations up to floating-point precision.

Q: Can I use it on consumer GPUs? A: It requires Ampere or newer GPUs (RTX 30xx/40xx, A100, H100). Older architectures are not supported.

Q: Does it work for inference as well as training? A: Yes. Both forward-only inference and forward-backward training are supported and optimized.

Q: Why is installation slow? A: It compiles CUDA kernels from source. Pre-built wheels are available for popular configurations to avoid this.

FlashAttention — Fast and Memory-Efficient Exact Attention

Introduction

What FlashAttention Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

torchtune — PyTorch-Native LLM Fine-Tuning Library

xFormers — Flexible and Efficient Transformers Library

llm.c — LLM Training in Simple Raw C/CUDA