# xFormers — Flexible and Efficient Transformers Library

> A modular PyTorch library by Meta for building and optimizing transformer models. xFormers provides memory-efficient attention kernels, composable building blocks, and performance primitives used across major AI projects.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# xFormers — Flexible and Efficient Transformers Library

## Quick Use
```bash
pip install xformers
python -c "
import torch
from xformers.ops import memory_efficient_attention
q = torch.randn(1, 128, 8, 64, device='cuda', dtype=torch.float16)
k = torch.randn(1, 128, 8, 64, device='cuda', dtype=torch.float16)
v = torch.randn(1, 128, 8, 64, device='cuda', dtype=torch.float16)
out = memory_efficient_attention(q, k, v)
"
```

## Introduction
xFormers is Meta's open-source library of optimized transformer components. It provides memory-efficient attention implementations, fused operations, and composable building blocks that let researchers mix and match transformer parts for custom architectures without sacrificing performance.

## What xFormers Does
- Provides memory-efficient attention with sub-quadratic memory usage
- Offers fused linear layers, dropout, layer norm, and SwiGLU operations
- Supports building custom transformer variants from composable blocks
- Includes heterogeneous attention patterns (block-sparse, causal, sliding window)
- Delivers optimized CUDA kernels for both training and inference

## Architecture Overview
xFormers is organized around a factory pattern where attention mechanisms, feedforward networks, and positional encodings are interchangeable components. The memory-efficient attention dispatcher automatically selects the best available kernel (FlashAttention, cutlass, or triton-based) depending on GPU architecture, data type, and tensor layout. Fused kernels combine multiple operations to reduce memory bandwidth overhead.

## Self-Hosting & Configuration
- Requires PyTorch 2.0+ and NVIDIA GPU with CUDA 11.4+
- Pre-built wheels available for common PyTorch/CUDA combinations
- Supports building from source for custom CUDA architectures
- Configuration is via Python API; no config files needed
- Works with FP16, BF16, and FP32 data types

## Key Features
- Automatic kernel dispatch selects the fastest attention backend available
- Memory-efficient attention enables 2x longer sequences on the same GPU
- Fused operations reduce kernel launch overhead and memory traffic
- Block-sparse attention patterns for structured sparsity research
- Used internally by Stable Diffusion, LLaMA, Detectron2, and other Meta projects

## Comparison with Similar Tools
- **FlashAttention** — provides the attention kernel; xFormers wraps it with dispatch logic and adds fused non-attention ops
- **PyTorch SDPA** — PyTorch's built-in attention function; xFormers offers additional fused ops and attention patterns beyond SDPA
- **DeepSpeed** — distributed training framework; xFormers focuses on single-device operator optimization
- **Triton** — GPU programming language for writing kernels; xFormers provides ready-to-use optimized kernels

## FAQ
**Q: Is xFormers only for Meta's models?**
A: No. It is a general-purpose library used by Hugging Face Diffusers, Stable Diffusion, and many community projects.

**Q: Does it work on AMD GPUs?**
A: Primarily NVIDIA. ROCm support is experimental and depends on underlying kernel compatibility.

**Q: How does it relate to PyTorch's native SDPA?**
A: PyTorch SDPA uses FlashAttention and other backends. xFormers provides additional fused operations, attention patterns, and optimizations beyond attention.

**Q: Can I use xFormers for inference only?**
A: Yes. The memory-efficient attention and fused operations benefit both training and inference workloads.

## Sources
- https://github.com/facebookresearch/xformers
- https://facebookresearch.github.io/xformers/

---
Source: https://tokrepo.com/en/workflows/xformers-flexible-efficient-transformers-library-dc389a36
Author: Script Depot