Introduction
GPT-NeoX is EleutherAI's distributed training framework built on top of Megatron-LM and DeepSpeed. It was designed to make training billion-parameter language models accessible to the open-source research community, and it produced the GPT-NeoX-20B and Pythia model suites.
What GPT-NeoX Does
- Trains autoregressive transformer language models at scales from millions to tens of billions of parameters
- Combines Megatron-style tensor parallelism with DeepSpeed ZeRO for efficient distributed training
- Supports rotary positional embeddings, parallel attention-FFN, and other modern LLM architecture choices
- Provides YAML-based configuration for full control over model architecture and training hyperparameters
- Includes evaluation harness integration for benchmarking trained models
Architecture Overview
GPT-NeoX fuses NVIDIA Megatron-LM's tensor and pipeline parallelism with Microsoft DeepSpeed's ZeRO optimizer stages. The training engine distributes model parameters, gradients, and optimizer states across GPUs, enabling models that exceed single-GPU memory. Model architecture and training settings are specified through composable YAML configs that override defaults hierarchically.
Self-Hosting & Configuration
- Requires Python 3.8+, PyTorch 1.8+, and NVIDIA GPUs with NCCL
- Multi-node training uses SSH or a cluster scheduler like SLURM
- All architecture and training options are set via YAML config files
- Pre-built Docker containers available for reproducible environments
- Data preprocessing scripts convert raw text to tokenized binary shards
Key Features
- Scales from a single GPU to hundreds of GPUs with model and data parallelism
- YAML-driven configuration makes experiments reproducible and easy to iterate
- Produced the Pythia model suite used in hundreds of research papers
- Supports FlashAttention, fused kernels, and mixed-precision training
- Evaluation pipeline integrates with EleutherAI's lm-evaluation-harness
Comparison with Similar Tools
- Megatron-LM — NVIDIA's training framework; GPT-NeoX adds DeepSpeed integration and simpler configuration
- DeepSpeed — optimization library; GPT-NeoX provides the full model definition and training loop on top of DeepSpeed
- LitGPT — Lightning-based GPT training; simpler setup but less flexibility at very large scale
- llm.c — minimal C/CUDA implementation; GPT-NeoX targets production-scale distributed training
FAQ
Q: Can I train a model from scratch with GPT-NeoX? A: Yes. It supports full pre-training from raw text data, including tokenization, data sharding, and distributed training.
Q: What models were trained with GPT-NeoX? A: GPT-NeoX-20B, the Pythia suite (70M to 12B), and Dolly 2.0 among others.
Q: How many GPUs do I need? A: A small model can train on a single GPU. Reproducing GPT-NeoX-20B used 96 A100 GPUs.
Q: Is GPT-NeoX still actively developed? A: The core codebase is stable. EleutherAI continues to use and maintain it for new research projects.