ConfigsMay 2, 2026·3 min read

GPT-NeoX — Open-Source Large Language Model Training Library

A GPU-optimized library by EleutherAI for training large-scale autoregressive language models. GPT-NeoX powered the training of GPT-NeoX-20B and Pythia, providing the open-source community with tools for billion-parameter model training.

Introduction

GPT-NeoX is EleutherAI's distributed training framework built on top of Megatron-LM and DeepSpeed. It was designed to make training billion-parameter language models accessible to the open-source research community, and it produced the GPT-NeoX-20B and Pythia model suites.

What GPT-NeoX Does

  • Trains autoregressive transformer language models at scales from millions to tens of billions of parameters
  • Combines Megatron-style tensor parallelism with DeepSpeed ZeRO for efficient distributed training
  • Supports rotary positional embeddings, parallel attention-FFN, and other modern LLM architecture choices
  • Provides YAML-based configuration for full control over model architecture and training hyperparameters
  • Includes evaluation harness integration for benchmarking trained models

Architecture Overview

GPT-NeoX fuses NVIDIA Megatron-LM's tensor and pipeline parallelism with Microsoft DeepSpeed's ZeRO optimizer stages. The training engine distributes model parameters, gradients, and optimizer states across GPUs, enabling models that exceed single-GPU memory. Model architecture and training settings are specified through composable YAML configs that override defaults hierarchically.

Self-Hosting & Configuration

  • Requires Python 3.8+, PyTorch 1.8+, and NVIDIA GPUs with NCCL
  • Multi-node training uses SSH or a cluster scheduler like SLURM
  • All architecture and training options are set via YAML config files
  • Pre-built Docker containers available for reproducible environments
  • Data preprocessing scripts convert raw text to tokenized binary shards

Key Features

  • Scales from a single GPU to hundreds of GPUs with model and data parallelism
  • YAML-driven configuration makes experiments reproducible and easy to iterate
  • Produced the Pythia model suite used in hundreds of research papers
  • Supports FlashAttention, fused kernels, and mixed-precision training
  • Evaluation pipeline integrates with EleutherAI's lm-evaluation-harness

Comparison with Similar Tools

  • Megatron-LM — NVIDIA's training framework; GPT-NeoX adds DeepSpeed integration and simpler configuration
  • DeepSpeed — optimization library; GPT-NeoX provides the full model definition and training loop on top of DeepSpeed
  • LitGPT — Lightning-based GPT training; simpler setup but less flexibility at very large scale
  • llm.c — minimal C/CUDA implementation; GPT-NeoX targets production-scale distributed training

FAQ

Q: Can I train a model from scratch with GPT-NeoX? A: Yes. It supports full pre-training from raw text data, including tokenization, data sharding, and distributed training.

Q: What models were trained with GPT-NeoX? A: GPT-NeoX-20B, the Pythia suite (70M to 12B), and Dolly 2.0 among others.

Q: How many GPUs do I need? A: A small model can train on a single GPU. Reproducing GPT-NeoX-20B used 96 A100 GPUs.

Q: Is GPT-NeoX still actively developed? A: The core codebase is stable. EleutherAI continues to use and maintain it for new research projects.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets