How do I install xFormers — Flexible and Efficient Transformers Library?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

xFormers — Flexible and Efficient Transformers Library

Introduction

xFormers is Meta's open-source library of optimized transformer components. It provides memory-efficient attention implementations, fused operations, and composable building blocks that let researchers mix and match transformer parts for custom architectures without sacrificing performance.

What xFormers Does

Provides memory-efficient attention with sub-quadratic memory usage
Offers fused linear layers, dropout, layer norm, and SwiGLU operations
Supports building custom transformer variants from composable blocks
Includes heterogeneous attention patterns (block-sparse, causal, sliding window)
Delivers optimized CUDA kernels for both training and inference

Architecture Overview

xFormers is organized around a factory pattern where attention mechanisms, feedforward networks, and positional encodings are interchangeable components. The memory-efficient attention dispatcher automatically selects the best available kernel (FlashAttention, cutlass, or triton-based) depending on GPU architecture, data type, and tensor layout. Fused kernels combine multiple operations to reduce memory bandwidth overhead.

Self-Hosting & Configuration

Requires PyTorch 2.0+ and NVIDIA GPU with CUDA 11.4+
Pre-built wheels available for common PyTorch/CUDA combinations
Supports building from source for custom CUDA architectures
Configuration is via Python API; no config files needed
Works with FP16, BF16, and FP32 data types

Key Features

Automatic kernel dispatch selects the fastest attention backend available
Memory-efficient attention enables 2x longer sequences on the same GPU
Fused operations reduce kernel launch overhead and memory traffic
Block-sparse attention patterns for structured sparsity research
Used internally by Stable Diffusion, LLaMA, Detectron2, and other Meta projects

Comparison with Similar Tools

FlashAttention — provides the attention kernel; xFormers wraps it with dispatch logic and adds fused non-attention ops
PyTorch SDPA — PyTorch's built-in attention function; xFormers offers additional fused ops and attention patterns beyond SDPA
DeepSpeed — distributed training framework; xFormers focuses on single-device operator optimization
Triton — GPU programming language for writing kernels; xFormers provides ready-to-use optimized kernels

FAQ

Q: Is xFormers only for Meta's models? A: No. It is a general-purpose library used by Hugging Face Diffusers, Stable Diffusion, and many community projects.

Q: Does it work on AMD GPUs? A: Primarily NVIDIA. ROCm support is experimental and depends on underlying kernel compatibility.

Q: How does it relate to PyTorch's native SDPA? A: PyTorch SDPA uses FlashAttention and other backends. xFormers provides additional fused operations, attention patterns, and optimizations beyond attention.

Q: Can I use xFormers for inference only? A: Yes. The memory-efficient attention and fused operations benefit both training and inference workloads.

xFormers — Flexible and Efficient Transformers Library

Introduction

What xFormers Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

torchtune — PyTorch-Native LLM Fine-Tuning Library

FlashAttention — Fast and Memory-Efficient Exact Attention

llm.c — LLM Training in Simple Raw C/CUDA