Introduction
FlashAttention is a fast, memory-efficient attention algorithm developed at Princeton and Stanford. Instead of materializing the full N x N attention matrix in GPU HBM, it uses tiling and kernel fusion to compute exact attention in SRAM, dramatically reducing memory usage and wall-clock time.
What FlashAttention Does
- Computes exact multi-head self-attention 2-4x faster than standard PyTorch implementations
- Reduces memory footprint from O(N^2) to O(N) in sequence length
- Supports causal masking, variable-length sequences, and grouped-query attention
- Provides both forward and backward pass kernels for training workloads
- Integrates as a drop-in replacement in transformer architectures
Architecture Overview
FlashAttention tiles the Q, K, V matrices into blocks that fit in GPU SRAM and computes attention block by block using the online softmax trick. This avoids writing the large intermediate attention matrix to slow HBM. FlashAttention-2 further optimizes parallelism across sequence length and reduces non-matmul FLOPs. FlashAttention-3 targets Hopper architecture GPUs with asynchronous operations.
Self-Hosting & Configuration
- Requires NVIDIA GPU with compute capability 8.0+ (Ampere, Ada, Hopper)
- CUDA 11.6+ and PyTorch 1.12+ needed for compilation
- Installation compiles custom CUDA kernels; use
--no-build-isolationwith pip - Supports FP16 and BF16 data types
- Pre-built wheels available for common CUDA/PyTorch version combinations
Key Features
- Exact attention (not an approximation) with IO-aware tiling strategy
- Enables training with much longer context lengths on the same hardware
- FlashAttention-2 achieves up to 230 TFLOPs/s on A100 (73% of theoretical peak)
- Natively supports multi-query and grouped-query attention patterns
- Used by Hugging Face Transformers, vLLM, xFormers, and most major LLM frameworks
Comparison with Similar Tools
- PyTorch scaled_dot_product_attention — incorporates FlashAttention as one backend; using flash-attn directly gives access to newer features
- xFormers memory_efficient_attention — Meta's attention library; FlashAttention provides the underlying kernel for xFormers
- Ring Attention — distributes attention across devices for very long sequences; complementary to FlashAttention
- PagedAttention (vLLM) — optimizes KV cache for inference serving; FlashAttention optimizes the attention computation itself
FAQ
Q: Does FlashAttention change model outputs? A: No. It computes mathematically exact attention, producing identical results to standard implementations up to floating-point precision.
Q: Can I use it on consumer GPUs? A: It requires Ampere or newer GPUs (RTX 30xx/40xx, A100, H100). Older architectures are not supported.
Q: Does it work for inference as well as training? A: Yes. Both forward-only inference and forward-backward training are supported and optimized.
Q: Why is installation slow? A: It compiles CUDA kernels from source. Pre-built wheels are available for popular configurations to avoid this.