# xFormers — Flexible and Efficient Transformers Library > A modular PyTorch library by Meta for building and optimizing transformer models. xFormers provides memory-efficient attention kernels, composable building blocks, and performance primitives used across major AI projects. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # xFormers — Flexible and Efficient Transformers Library ## Quick Use ```bash pip install xformers python -c " import torch from xformers.ops import memory_efficient_attention q = torch.randn(1, 128, 8, 64, device='cuda', dtype=torch.float16) k = torch.randn(1, 128, 8, 64, device='cuda', dtype=torch.float16) v = torch.randn(1, 128, 8, 64, device='cuda', dtype=torch.float16) out = memory_efficient_attention(q, k, v) " ``` ## Introduction xFormers is Meta's open-source library of optimized transformer components. It provides memory-efficient attention implementations, fused operations, and composable building blocks that let researchers mix and match transformer parts for custom architectures without sacrificing performance. ## What xFormers Does - Provides memory-efficient attention with sub-quadratic memory usage - Offers fused linear layers, dropout, layer norm, and SwiGLU operations - Supports building custom transformer variants from composable blocks - Includes heterogeneous attention patterns (block-sparse, causal, sliding window) - Delivers optimized CUDA kernels for both training and inference ## Architecture Overview xFormers is organized around a factory pattern where attention mechanisms, feedforward networks, and positional encodings are interchangeable components. The memory-efficient attention dispatcher automatically selects the best available kernel (FlashAttention, cutlass, or triton-based) depending on GPU architecture, data type, and tensor layout. Fused kernels combine multiple operations to reduce memory bandwidth overhead. ## Self-Hosting & Configuration - Requires PyTorch 2.0+ and NVIDIA GPU with CUDA 11.4+ - Pre-built wheels available for common PyTorch/CUDA combinations - Supports building from source for custom CUDA architectures - Configuration is via Python API; no config files needed - Works with FP16, BF16, and FP32 data types ## Key Features - Automatic kernel dispatch selects the fastest attention backend available - Memory-efficient attention enables 2x longer sequences on the same GPU - Fused operations reduce kernel launch overhead and memory traffic - Block-sparse attention patterns for structured sparsity research - Used internally by Stable Diffusion, LLaMA, Detectron2, and other Meta projects ## Comparison with Similar Tools - **FlashAttention** — provides the attention kernel; xFormers wraps it with dispatch logic and adds fused non-attention ops - **PyTorch SDPA** — PyTorch's built-in attention function; xFormers offers additional fused ops and attention patterns beyond SDPA - **DeepSpeed** — distributed training framework; xFormers focuses on single-device operator optimization - **Triton** — GPU programming language for writing kernels; xFormers provides ready-to-use optimized kernels ## FAQ **Q: Is xFormers only for Meta's models?** A: No. It is a general-purpose library used by Hugging Face Diffusers, Stable Diffusion, and many community projects. **Q: Does it work on AMD GPUs?** A: Primarily NVIDIA. ROCm support is experimental and depends on underlying kernel compatibility. **Q: How does it relate to PyTorch's native SDPA?** A: PyTorch SDPA uses FlashAttention and other backends. xFormers provides additional fused operations, attention patterns, and optimizations beyond attention. **Q: Can I use xFormers for inference only?** A: Yes. The memory-efficient attention and fused operations benefit both training and inference workloads. ## Sources - https://github.com/facebookresearch/xformers - https://facebookresearch.github.io/xformers/ --- Source: https://tokrepo.com/en/workflows/xformers-flexible-efficient-transformers-library-dc389a36 Author: Script Depot