Introduction
FairScale is a PyTorch library from Meta AI that provides primitives for training models that are too large to fit on a single GPU. It introduced Fully Sharded Data Parallel (FSDP), which shards model parameters, gradients, and optimizer states across GPUs to dramatically reduce per-device memory usage while maintaining training speed.
What FairScale Does
- Shards model parameters across GPUs with Fully Sharded Data Parallel
- Splits model layers across devices with pipeline parallelism
- Reduces memory usage via activation checkpointing and offloading
- Provides memory-efficient optimizers like Adascale and OSS
- Enables training of billion-parameter models on commodity hardware
Architecture Overview
FairScale wraps PyTorch modules with parallelism strategies. FSDP partitions each parameter tensor across the data-parallel group, gathering shards on demand for forward and backward passes. Pipeline parallelism splits sequential model stages across devices and uses micro-batching to keep all GPUs active. Both strategies compose with standard PyTorch training loops.
Self-Hosting & Configuration
- Install via pip alongside PyTorch 1.8 or later
- Wrap your model with FullyShardedDataParallel for memory-efficient training
- Set auto_wrap_policy to automatically shard at Transformer block boundaries
- Use Pipe for pipeline parallelism with manual or automatic stage splitting
- Combine with activation checkpointing to further reduce memory
Key Features
- FSDP enables training models that exceed single-GPU memory
- Optimizer State Sharding (OSS) reduces optimizer memory by the world size
- Pipeline parallelism with asynchronous scheduling for high throughput
- Activation checkpointing trades compute for memory savings
- Drop-in compatibility with existing PyTorch training scripts
Comparison with Similar Tools
- PyTorch FSDP (native) — FairScale's FSDP was upstreamed into PyTorch core; FairScale remains useful for experimental features
- DeepSpeed — broader optimization suite with ZeRO stages; FairScale is lighter and more PyTorch-native
- Megatron-LM — designed for massive GPU clusters; FairScale is more accessible for smaller teams
- Accelerate — higher-level API that can use FSDP internally; FairScale provides the lower-level primitives
FAQ
Q: Should I use FairScale or PyTorch native FSDP? A: For stable production use, prefer PyTorch native FSDP (torch.distributed.fsdp). FairScale is useful for experimental features not yet upstreamed.
Q: How much memory does FSDP save? A: FSDP can reduce per-GPU memory by the number of GPUs. With 8 GPUs, each device holds roughly 1/8th of the model parameters.
Q: Can I use FairScale with a single GPU? A: Some features like activation checkpointing work on a single GPU, but FSDP and pipeline parallelism require multiple GPUs.
Q: Does FairScale support mixed precision? A: Yes. FairScale FSDP works with PyTorch AMP for FP16 or BF16 mixed-precision training.