nanoGPT — The Simplest Repository for Training Medium-Sized GPTs

Introduction

nanoGPT is a minimal PyTorch reimplementation of GPT training created by Andrej Karpathy. It is designed to be the simplest, most readable code for training and fine-tuning medium-sized GPT models, making transformer internals accessible to anyone who can read Python.

What nanoGPT Does

Trains GPT-2 scale models from scratch on custom datasets
Reproduces GPT-2 (124M) on OpenWebText in about 4 days on 8x A100 GPUs
Supports character-level and BPE tokenization
Provides data preparation scripts for Shakespeare, OpenWebText, and custom corpora
Enables fine-tuning pre-trained GPT-2 checkpoints on new data

Architecture Overview

The entire model is defined in a single model.py file implementing a standard decoder-only transformer with causal self-attention, GELU activations, and optional Flash Attention. Training logic lives in train.py using PyTorch DDP for multi-GPU and mixed precision via torch.amp. Configuration is pure Python files that override defaults.

Self-Hosting & Configuration

Requires Python 3.10+ and PyTorch 2.0+
Data preparation scripts convert raw text into memory-mapped binary token arrays
Config files set model size, learning rate, batch size, and device count
Supports single GPU, multi-GPU DDP, and Apple MPS backends
Weights & Biases integration is optional via the --wandb_log flag

Key Features

Entire training codebase fits in roughly 300 lines of Python
Reproduces published GPT-2 results at research-grade quality
Flash Attention support via PyTorch 2.0 scaled_dot_product_attention
Sampling script generates text from trained checkpoints immediately
Clean separation of data prep, training, and inference stages

Comparison with Similar Tools

Hugging Face Transformers — full-featured library with thousands of models; nanoGPT is purpose-built for learning and small experiments
Megatron-LM — NVIDIA's large-scale training framework; far more complex, targets multi-node clusters
LitGPT — Lightning-based GPT training; adds configuration abstractions nanoGPT deliberately avoids
minGPT — Karpathy's earlier project; nanoGPT is the faster, more optimized successor

FAQ

Q: Can I train a production-grade LLM with nanoGPT? A: It is optimized for learning and reproducing GPT-2. For production-scale training, frameworks like Megatron-LM or LLaMA-Factory are more appropriate.

Q: What hardware do I need? A: A single consumer GPU with 8 GB VRAM can train the character-level Shakespeare model. Reproducing GPT-2 124M requires multiple A100s.

Q: Does it support LoRA or adapter-based fine-tuning? A: Not natively. The codebase does full-parameter fine-tuning. Community forks add PEFT methods.

Q: Is the code actively maintained? A: The repository is intentionally minimal and stable. Updates are infrequent by design.

nanoGPT — The Simplest Repository for Training Medium-Sized GPTs

Introduction

What nanoGPT Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

torchtune — PyTorch-Native LLM Fine-Tuning Library

xFormers — Flexible and Efficient Transformers Library

FlashAttention — Fast and Memory-Efficient Exact Attention