Introduction
xFormers is Meta's open-source library of optimized transformer components. It provides memory-efficient attention implementations, fused operations, and composable building blocks that let researchers mix and match transformer parts for custom architectures without sacrificing performance.
What xFormers Does
- Provides memory-efficient attention with sub-quadratic memory usage
- Offers fused linear layers, dropout, layer norm, and SwiGLU operations
- Supports building custom transformer variants from composable blocks
- Includes heterogeneous attention patterns (block-sparse, causal, sliding window)
- Delivers optimized CUDA kernels for both training and inference
Architecture Overview
xFormers is organized around a factory pattern where attention mechanisms, feedforward networks, and positional encodings are interchangeable components. The memory-efficient attention dispatcher automatically selects the best available kernel (FlashAttention, cutlass, or triton-based) depending on GPU architecture, data type, and tensor layout. Fused kernels combine multiple operations to reduce memory bandwidth overhead.
Self-Hosting & Configuration
- Requires PyTorch 2.0+ and NVIDIA GPU with CUDA 11.4+
- Pre-built wheels available for common PyTorch/CUDA combinations
- Supports building from source for custom CUDA architectures
- Configuration is via Python API; no config files needed
- Works with FP16, BF16, and FP32 data types
Key Features
- Automatic kernel dispatch selects the fastest attention backend available
- Memory-efficient attention enables 2x longer sequences on the same GPU
- Fused operations reduce kernel launch overhead and memory traffic
- Block-sparse attention patterns for structured sparsity research
- Used internally by Stable Diffusion, LLaMA, Detectron2, and other Meta projects
Comparison with Similar Tools
- FlashAttention — provides the attention kernel; xFormers wraps it with dispatch logic and adds fused non-attention ops
- PyTorch SDPA — PyTorch's built-in attention function; xFormers offers additional fused ops and attention patterns beyond SDPA
- DeepSpeed — distributed training framework; xFormers focuses on single-device operator optimization
- Triton — GPU programming language for writing kernels; xFormers provides ready-to-use optimized kernels
FAQ
Q: Is xFormers only for Meta's models? A: No. It is a general-purpose library used by Hugging Face Diffusers, Stable Diffusion, and many community projects.
Q: Does it work on AMD GPUs? A: Primarily NVIDIA. ROCm support is experimental and depends on underlying kernel compatibility.
Q: How does it relate to PyTorch's native SDPA? A: PyTorch SDPA uses FlashAttention and other backends. xFormers provides additional fused operations, attention patterns, and optimizations beyond attention.
Q: Can I use xFormers for inference only? A: Yes. The memory-efficient attention and fused operations benefit both training and inference workloads.