# FairScale — PyTorch Extensions for Large-Scale Training

> FairScale provides PyTorch extensions for high-performance and large-scale training, including fully sharded data parallelism, pipeline parallelism, and memory-efficient optimizers.

## Install

Save in your project root:

# FairScale — PyTorch Extensions for Large-Scale Training

## Quick Use
```bash
pip install fairscale
python -c "
import torch
from fairscale.nn import FullyShardedDataParallel as FSDP
model = torch.nn.Linear(1024, 1024)
fsdp_model = FSDP(model)
optimizer = torch.optim.Adam(fsdp_model.parameters(), lr=1e-3)
x = torch.randn(32, 1024)
loss = fsdp_model(x).sum()
loss.backward()
optimizer.step()
print('FSDP training step complete')
"
```

## Introduction
FairScale is a PyTorch library from Meta AI that provides primitives for training models that are too large to fit on a single GPU. It introduced Fully Sharded Data Parallel (FSDP), which shards model parameters, gradients, and optimizer states across GPUs to dramatically reduce per-device memory usage while maintaining training speed.

## What FairScale Does
- Shards model parameters across GPUs with Fully Sharded Data Parallel
- Splits model layers across devices with pipeline parallelism
- Reduces memory usage via activation checkpointing and offloading
- Provides memory-efficient optimizers like Adascale and OSS
- Enables training of billion-parameter models on commodity hardware

## Architecture Overview
FairScale wraps PyTorch modules with parallelism strategies. FSDP partitions each parameter tensor across the data-parallel group, gathering shards on demand for forward and backward passes. Pipeline parallelism splits sequential model stages across devices and uses micro-batching to keep all GPUs active. Both strategies compose with standard PyTorch training loops.

## Self-Hosting & Configuration
- Install via pip alongside PyTorch 1.8 or later
- Wrap your model with FullyShardedDataParallel for memory-efficient training
- Set auto_wrap_policy to automatically shard at Transformer block boundaries
- Use Pipe for pipeline parallelism with manual or automatic stage splitting
- Combine with activation checkpointing to further reduce memory

## Key Features
- FSDP enables training models that exceed single-GPU memory
- Optimizer State Sharding (OSS) reduces optimizer memory by the world size
- Pipeline parallelism with asynchronous scheduling for high throughput
- Activation checkpointing trades compute for memory savings
- Drop-in compatibility with existing PyTorch training scripts

## Comparison with Similar Tools
- **PyTorch FSDP (native)** — FairScale's FSDP was upstreamed into PyTorch core; FairScale remains useful for experimental features
- **DeepSpeed** — broader optimization suite with ZeRO stages; FairScale is lighter and more PyTorch-native
- **Megatron-LM** — designed for massive GPU clusters; FairScale is more accessible for smaller teams
- **Accelerate** — higher-level API that can use FSDP internally; FairScale provides the lower-level primitives

## FAQ
**Q: Should I use FairScale or PyTorch native FSDP?**
A: For stable production use, prefer PyTorch native FSDP (torch.distributed.fsdp). FairScale is useful for experimental features not yet upstreamed.

**Q: How much memory does FSDP save?**
A: FSDP can reduce per-GPU memory by the number of GPUs. With 8 GPUs, each device holds roughly 1/8th of the model parameters.

**Q: Can I use FairScale with a single GPU?**
A: Some features like activation checkpointing work on a single GPU, but FSDP and pipeline parallelism require multiple GPUs.

**Q: Does FairScale support mixed precision?**
A: Yes. FairScale FSDP works with PyTorch AMP for FP16 or BF16 mixed-precision training.

## Sources
- https://github.com/facebookresearch/fairscale
- https://fairscale.readthedocs.io/

---
Source: https://tokrepo.com/en/workflows/asset-ea054e01
Author: AI Open Source