Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 12, 2026·3 min de lecture

FairScale — PyTorch Extensions for Large-Scale Training

FairScale provides PyTorch extensions for high-performance and large-scale training, including fully sharded data parallelism, pipeline parallelism, and memory-efficient optimizers.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
FairScale Training
Commande CLI universelle
npx tokrepo install ea054e01-4ddd-11f1-9bc6-00163e2b0d79

Introduction

FairScale is a PyTorch library from Meta AI that provides primitives for training models that are too large to fit on a single GPU. It introduced Fully Sharded Data Parallel (FSDP), which shards model parameters, gradients, and optimizer states across GPUs to dramatically reduce per-device memory usage while maintaining training speed.

What FairScale Does

  • Shards model parameters across GPUs with Fully Sharded Data Parallel
  • Splits model layers across devices with pipeline parallelism
  • Reduces memory usage via activation checkpointing and offloading
  • Provides memory-efficient optimizers like Adascale and OSS
  • Enables training of billion-parameter models on commodity hardware

Architecture Overview

FairScale wraps PyTorch modules with parallelism strategies. FSDP partitions each parameter tensor across the data-parallel group, gathering shards on demand for forward and backward passes. Pipeline parallelism splits sequential model stages across devices and uses micro-batching to keep all GPUs active. Both strategies compose with standard PyTorch training loops.

Self-Hosting & Configuration

  • Install via pip alongside PyTorch 1.8 or later
  • Wrap your model with FullyShardedDataParallel for memory-efficient training
  • Set auto_wrap_policy to automatically shard at Transformer block boundaries
  • Use Pipe for pipeline parallelism with manual or automatic stage splitting
  • Combine with activation checkpointing to further reduce memory

Key Features

  • FSDP enables training models that exceed single-GPU memory
  • Optimizer State Sharding (OSS) reduces optimizer memory by the world size
  • Pipeline parallelism with asynchronous scheduling for high throughput
  • Activation checkpointing trades compute for memory savings
  • Drop-in compatibility with existing PyTorch training scripts

Comparison with Similar Tools

  • PyTorch FSDP (native) — FairScale's FSDP was upstreamed into PyTorch core; FairScale remains useful for experimental features
  • DeepSpeed — broader optimization suite with ZeRO stages; FairScale is lighter and more PyTorch-native
  • Megatron-LM — designed for massive GPU clusters; FairScale is more accessible for smaller teams
  • Accelerate — higher-level API that can use FSDP internally; FairScale provides the lower-level primitives

FAQ

Q: Should I use FairScale or PyTorch native FSDP? A: For stable production use, prefer PyTorch native FSDP (torch.distributed.fsdp). FairScale is useful for experimental features not yet upstreamed.

Q: How much memory does FSDP save? A: FSDP can reduce per-GPU memory by the number of GPUs. With 8 GPUs, each device holds roughly 1/8th of the model parameters.

Q: Can I use FairScale with a single GPU? A: Some features like activation checkpointing work on a single GPU, but FSDP and pipeline parallelism require multiple GPUs.

Q: Does FairScale support mixed precision? A: Yes. FairScale FSDP works with PyTorch AMP for FP16 or BF16 mixed-precision training.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires