# Flash Attention — Fast Memory-Efficient Exact Attention for Transformers

> Flash Attention is a CUDA kernel library that computes exact scaled dot-product attention 2-4x faster and with up to 20x less memory than standard implementations by using IO-aware tiling to minimize GPU memory reads and writes.

## Install

Save as a script file and run:

## Quick Use
```bash
pip install flash-attn --no-build-isolation
python -c "from flash_attn import flash_attn_func; print('Flash Attention ready')"
```

## Introduction
Flash Attention is a fast and memory-efficient implementation of exact attention for transformer models. Developed at Princeton and Tri Dao's lab, it rethinks the attention computation to minimize data movement between GPU high-bandwidth memory and on-chip SRAM, achieving significant speedups without any approximation.

## What Flash Attention Does
- Computes exact scaled dot-product attention 2-4x faster than PyTorch native attention
- Reduces memory usage from quadratic to linear in sequence length via tiling
- Supports causal masking, variable-length sequences, and multi-query/grouped-query attention
- Provides fused kernels for the forward and backward pass in training
- Enables training with much longer context windows on the same hardware

## Architecture Overview
Flash Attention tiles the Q, K, V matrices into blocks that fit in GPU SRAM and computes attention incrementally using the online softmax trick. By never materializing the full N x N attention matrix in HBM, it reduces memory IO by an order of magnitude. Flash Attention 2 further optimizes parallelism across sequence length and attention heads, and Flash Attention 3 adds asynchronous pipelining on Hopper GPUs.

## Self-Hosting & Configuration
- Install via pip with `pip install flash-attn --no-build-isolation` (requires CUDA 11.6+ and a compatible GPU)
- Supported on NVIDIA Ampere (A100), Ada (RTX 4090), and Hopper (H100) architectures
- Drop-in replacement for `torch.nn.functional.scaled_dot_product_attention`
- Hugging Face Transformers integrates Flash Attention via `attn_implementation="flash_attention_2"`
- Build from source for custom CUDA architectures or development

## Key Features
- Exact computation with no approximation or accuracy loss
- IO-aware tiling eliminates the quadratic memory bottleneck
- Fused backward pass kernels for efficient training
- Supports head dimensions up to 256 and FP16/BF16 datatypes
- Widely adopted as default attention in major LLM training frameworks

## Comparison with Similar Tools
- **PyTorch SDPA** — built-in scaled dot-product attention; uses Flash Attention as one backend
- **xFormers** — Meta library with memory-efficient attention; Flash Attention often faster for standard cases
- **FlashInfer** — optimized for inference serving with PagedAttention; complementary to Flash Attention
- **Triton kernels** — custom attention in Triton language; more flexible but typically slower
- **Ring Attention** — distributes attention across devices for very long sequences; orthogonal optimization

## FAQ
**Q: Does Flash Attention change model outputs?**
A: No. It computes exact attention. Outputs match standard attention up to floating-point rounding.

**Q: Which GPUs are supported?**
A: NVIDIA Ampere (SM 80), Ada Lovelace (SM 89), and Hopper (SM 90) architectures. Older GPUs like V100 are not supported.

**Q: Can I use it for inference only?**
A: Yes. Flash Attention speeds up both training and inference. Many serving frameworks use it by default.

**Q: How do I enable it in Hugging Face Transformers?**
A: Pass `attn_implementation="flash_attention_2"` when loading a model with `AutoModelForCausalLM.from_pretrained()`.

## Sources
- https://github.com/Dao-AILab/flash-attention
- https://tridao.me/publications/flash2/flash2.pdf

---
Source: https://tokrepo.com/en/workflows/c2683056-42b9-11f1-9bc6-00163e2b0d79
Author: Script Depot