# FairScale — PyTorch Extensions for Large-Scale Training > FairScale provides PyTorch extensions for high-performance and large-scale training, including fully sharded data parallelism, pipeline parallelism, and memory-efficient optimizers. ## Install Save in your project root: # FairScale — PyTorch Extensions for Large-Scale Training ## Quick Use ```bash pip install fairscale python -c " import torch from fairscale.nn import FullyShardedDataParallel as FSDP model = torch.nn.Linear(1024, 1024) fsdp_model = FSDP(model) optimizer = torch.optim.Adam(fsdp_model.parameters(), lr=1e-3) x = torch.randn(32, 1024) loss = fsdp_model(x).sum() loss.backward() optimizer.step() print('FSDP training step complete') " ``` ## Introduction FairScale is a PyTorch library from Meta AI that provides primitives for training models that are too large to fit on a single GPU. It introduced Fully Sharded Data Parallel (FSDP), which shards model parameters, gradients, and optimizer states across GPUs to dramatically reduce per-device memory usage while maintaining training speed. ## What FairScale Does - Shards model parameters across GPUs with Fully Sharded Data Parallel - Splits model layers across devices with pipeline parallelism - Reduces memory usage via activation checkpointing and offloading - Provides memory-efficient optimizers like Adascale and OSS - Enables training of billion-parameter models on commodity hardware ## Architecture Overview FairScale wraps PyTorch modules with parallelism strategies. FSDP partitions each parameter tensor across the data-parallel group, gathering shards on demand for forward and backward passes. Pipeline parallelism splits sequential model stages across devices and uses micro-batching to keep all GPUs active. Both strategies compose with standard PyTorch training loops. ## Self-Hosting & Configuration - Install via pip alongside PyTorch 1.8 or later - Wrap your model with FullyShardedDataParallel for memory-efficient training - Set auto_wrap_policy to automatically shard at Transformer block boundaries - Use Pipe for pipeline parallelism with manual or automatic stage splitting - Combine with activation checkpointing to further reduce memory ## Key Features - FSDP enables training models that exceed single-GPU memory - Optimizer State Sharding (OSS) reduces optimizer memory by the world size - Pipeline parallelism with asynchronous scheduling for high throughput - Activation checkpointing trades compute for memory savings - Drop-in compatibility with existing PyTorch training scripts ## Comparison with Similar Tools - **PyTorch FSDP (native)** — FairScale's FSDP was upstreamed into PyTorch core; FairScale remains useful for experimental features - **DeepSpeed** — broader optimization suite with ZeRO stages; FairScale is lighter and more PyTorch-native - **Megatron-LM** — designed for massive GPU clusters; FairScale is more accessible for smaller teams - **Accelerate** — higher-level API that can use FSDP internally; FairScale provides the lower-level primitives ## FAQ **Q: Should I use FairScale or PyTorch native FSDP?** A: For stable production use, prefer PyTorch native FSDP (torch.distributed.fsdp). FairScale is useful for experimental features not yet upstreamed. **Q: How much memory does FSDP save?** A: FSDP can reduce per-GPU memory by the number of GPUs. With 8 GPUs, each device holds roughly 1/8th of the model parameters. **Q: Can I use FairScale with a single GPU?** A: Some features like activation checkpointing work on a single GPU, but FSDP and pipeline parallelism require multiple GPUs. **Q: Does FairScale support mixed precision?** A: Yes. FairScale FSDP works with PyTorch AMP for FP16 or BF16 mixed-precision training. ## Sources - https://github.com/facebookresearch/fairscale - https://fairscale.readthedocs.io/ --- Source: https://tokrepo.com/en/workflows/asset-ea054e01 Author: AI Open Source