Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 2, 2026·3 min de lecture

xFormers — Flexible and Efficient Transformers Library

A modular PyTorch library by Meta for building and optimizing transformer models. xFormers provides memory-efficient attention kernels, composable building blocks, and performance primitives used across major AI projects.

Introduction

xFormers is Meta's open-source library of optimized transformer components. It provides memory-efficient attention implementations, fused operations, and composable building blocks that let researchers mix and match transformer parts for custom architectures without sacrificing performance.

What xFormers Does

  • Provides memory-efficient attention with sub-quadratic memory usage
  • Offers fused linear layers, dropout, layer norm, and SwiGLU operations
  • Supports building custom transformer variants from composable blocks
  • Includes heterogeneous attention patterns (block-sparse, causal, sliding window)
  • Delivers optimized CUDA kernels for both training and inference

Architecture Overview

xFormers is organized around a factory pattern where attention mechanisms, feedforward networks, and positional encodings are interchangeable components. The memory-efficient attention dispatcher automatically selects the best available kernel (FlashAttention, cutlass, or triton-based) depending on GPU architecture, data type, and tensor layout. Fused kernels combine multiple operations to reduce memory bandwidth overhead.

Self-Hosting & Configuration

  • Requires PyTorch 2.0+ and NVIDIA GPU with CUDA 11.4+
  • Pre-built wheels available for common PyTorch/CUDA combinations
  • Supports building from source for custom CUDA architectures
  • Configuration is via Python API; no config files needed
  • Works with FP16, BF16, and FP32 data types

Key Features

  • Automatic kernel dispatch selects the fastest attention backend available
  • Memory-efficient attention enables 2x longer sequences on the same GPU
  • Fused operations reduce kernel launch overhead and memory traffic
  • Block-sparse attention patterns for structured sparsity research
  • Used internally by Stable Diffusion, LLaMA, Detectron2, and other Meta projects

Comparison with Similar Tools

  • FlashAttention — provides the attention kernel; xFormers wraps it with dispatch logic and adds fused non-attention ops
  • PyTorch SDPA — PyTorch's built-in attention function; xFormers offers additional fused ops and attention patterns beyond SDPA
  • DeepSpeed — distributed training framework; xFormers focuses on single-device operator optimization
  • Triton — GPU programming language for writing kernels; xFormers provides ready-to-use optimized kernels

FAQ

Q: Is xFormers only for Meta's models? A: No. It is a general-purpose library used by Hugging Face Diffusers, Stable Diffusion, and many community projects.

Q: Does it work on AMD GPUs? A: Primarily NVIDIA. ROCm support is experimental and depends on underlying kernel compatibility.

Q: How does it relate to PyTorch's native SDPA? A: PyTorch SDPA uses FlashAttention and other backends. xFormers provides additional fused operations, attention patterns, and optimizations beyond attention.

Q: Can I use xFormers for inference only? A: Yes. The memory-efficient attention and fused operations benefit both training and inference workloads.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires