How do I install FairScale — PyTorch Extensions for Large-Scale Training?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

FairScale — PyTorch Extensions for Large-Scale Training

Introduction

FairScale is a PyTorch library from Meta AI that provides primitives for training models that are too large to fit on a single GPU. It introduced Fully Sharded Data Parallel (FSDP), which shards model parameters, gradients, and optimizer states across GPUs to dramatically reduce per-device memory usage while maintaining training speed.

What FairScale Does

Shards model parameters across GPUs with Fully Sharded Data Parallel
Splits model layers across devices with pipeline parallelism
Reduces memory usage via activation checkpointing and offloading
Provides memory-efficient optimizers like Adascale and OSS
Enables training of billion-parameter models on commodity hardware

Architecture Overview

FairScale wraps PyTorch modules with parallelism strategies. FSDP partitions each parameter tensor across the data-parallel group, gathering shards on demand for forward and backward passes. Pipeline parallelism splits sequential model stages across devices and uses micro-batching to keep all GPUs active. Both strategies compose with standard PyTorch training loops.

Self-Hosting & Configuration

Install via pip alongside PyTorch 1.8 or later
Wrap your model with FullyShardedDataParallel for memory-efficient training
Set auto_wrap_policy to automatically shard at Transformer block boundaries
Use Pipe for pipeline parallelism with manual or automatic stage splitting
Combine with activation checkpointing to further reduce memory

Key Features

FSDP enables training models that exceed single-GPU memory
Optimizer State Sharding (OSS) reduces optimizer memory by the world size
Pipeline parallelism with asynchronous scheduling for high throughput
Activation checkpointing trades compute for memory savings
Drop-in compatibility with existing PyTorch training scripts

Comparison with Similar Tools

PyTorch FSDP (native) — FairScale's FSDP was upstreamed into PyTorch core; FairScale remains useful for experimental features
DeepSpeed — broader optimization suite with ZeRO stages; FairScale is lighter and more PyTorch-native
Megatron-LM — designed for massive GPU clusters; FairScale is more accessible for smaller teams
Accelerate — higher-level API that can use FSDP internally; FairScale provides the lower-level primitives

FAQ

Q: Should I use FairScale or PyTorch native FSDP? A: For stable production use, prefer PyTorch native FSDP (torch.distributed.fsdp). FairScale is useful for experimental features not yet upstreamed.

Q: How much memory does FSDP save? A: FSDP can reduce per-GPU memory by the number of GPUs. With 8 GPUs, each device holds roughly 1/8th of the model parameters.

Q: Can I use FairScale with a single GPU? A: Some features like activation checkpointing work on a single GPU, but FSDP and pipeline parallelism require multiple GPUs.

Q: Does FairScale support mixed precision? A: Yes. FairScale FSDP works with PyTorch AMP for FP16 or BF16 mixed-precision training.

FairScale — PyTorch Extensions for Large-Scale Training

Introduction

What FairScale Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

MosaicML Composer — Efficient Large-Scale Model Training

PyTorch Lightning — Scalable Deep Learning Framework

torchvision — Computer Vision Models, Datasets & Transforms for PyTorch

timm — Pretrained Vision Models and Layers for PyTorch