Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 12, 2026·3 min de lectura

FairScale — PyTorch Extensions for Large-Scale Training

FairScale provides PyTorch extensions for high-performance and large-scale training, including fully sharded data parallelism, pipeline parallelism, and memory-efficient optimizers.

Introduction

FairScale is a PyTorch library from Meta AI that provides primitives for training models that are too large to fit on a single GPU. It introduced Fully Sharded Data Parallel (FSDP), which shards model parameters, gradients, and optimizer states across GPUs to dramatically reduce per-device memory usage while maintaining training speed.

What FairScale Does

  • Shards model parameters across GPUs with Fully Sharded Data Parallel
  • Splits model layers across devices with pipeline parallelism
  • Reduces memory usage via activation checkpointing and offloading
  • Provides memory-efficient optimizers like Adascale and OSS
  • Enables training of billion-parameter models on commodity hardware

Architecture Overview

FairScale wraps PyTorch modules with parallelism strategies. FSDP partitions each parameter tensor across the data-parallel group, gathering shards on demand for forward and backward passes. Pipeline parallelism splits sequential model stages across devices and uses micro-batching to keep all GPUs active. Both strategies compose with standard PyTorch training loops.

Self-Hosting & Configuration

  • Install via pip alongside PyTorch 1.8 or later
  • Wrap your model with FullyShardedDataParallel for memory-efficient training
  • Set auto_wrap_policy to automatically shard at Transformer block boundaries
  • Use Pipe for pipeline parallelism with manual or automatic stage splitting
  • Combine with activation checkpointing to further reduce memory

Key Features

  • FSDP enables training models that exceed single-GPU memory
  • Optimizer State Sharding (OSS) reduces optimizer memory by the world size
  • Pipeline parallelism with asynchronous scheduling for high throughput
  • Activation checkpointing trades compute for memory savings
  • Drop-in compatibility with existing PyTorch training scripts

Comparison with Similar Tools

  • PyTorch FSDP (native) — FairScale's FSDP was upstreamed into PyTorch core; FairScale remains useful for experimental features
  • DeepSpeed — broader optimization suite with ZeRO stages; FairScale is lighter and more PyTorch-native
  • Megatron-LM — designed for massive GPU clusters; FairScale is more accessible for smaller teams
  • Accelerate — higher-level API that can use FSDP internally; FairScale provides the lower-level primitives

FAQ

Q: Should I use FairScale or PyTorch native FSDP? A: For stable production use, prefer PyTorch native FSDP (torch.distributed.fsdp). FairScale is useful for experimental features not yet upstreamed.

Q: How much memory does FSDP save? A: FSDP can reduce per-GPU memory by the number of GPUs. With 8 GPUs, each device holds roughly 1/8th of the model parameters.

Q: Can I use FairScale with a single GPU? A: Some features like activation checkpointing work on a single GPU, but FSDP and pipeline parallelism require multiple GPUs.

Q: Does FairScale support mixed precision? A: Yes. FairScale FSDP works with PyTorch AMP for FP16 or BF16 mixed-precision training.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados