Introduction
nanoGPT is a minimal PyTorch reimplementation of GPT training created by Andrej Karpathy. It is designed to be the simplest, most readable code for training and fine-tuning medium-sized GPT models, making transformer internals accessible to anyone who can read Python.
What nanoGPT Does
- Trains GPT-2 scale models from scratch on custom datasets
- Reproduces GPT-2 (124M) on OpenWebText in about 4 days on 8x A100 GPUs
- Supports character-level and BPE tokenization
- Provides data preparation scripts for Shakespeare, OpenWebText, and custom corpora
- Enables fine-tuning pre-trained GPT-2 checkpoints on new data
Architecture Overview
The entire model is defined in a single model.py file implementing a standard decoder-only transformer with causal self-attention, GELU activations, and optional Flash Attention. Training logic lives in train.py using PyTorch DDP for multi-GPU and mixed precision via torch.amp. Configuration is pure Python files that override defaults.
Self-Hosting & Configuration
- Requires Python 3.10+ and PyTorch 2.0+
- Data preparation scripts convert raw text into memory-mapped binary token arrays
- Config files set model size, learning rate, batch size, and device count
- Supports single GPU, multi-GPU DDP, and Apple MPS backends
- Weights & Biases integration is optional via the
--wandb_logflag
Key Features
- Entire training codebase fits in roughly 300 lines of Python
- Reproduces published GPT-2 results at research-grade quality
- Flash Attention support via PyTorch 2.0 scaled_dot_product_attention
- Sampling script generates text from trained checkpoints immediately
- Clean separation of data prep, training, and inference stages
Comparison with Similar Tools
- Hugging Face Transformers — full-featured library with thousands of models; nanoGPT is purpose-built for learning and small experiments
- Megatron-LM — NVIDIA's large-scale training framework; far more complex, targets multi-node clusters
- LitGPT — Lightning-based GPT training; adds configuration abstractions nanoGPT deliberately avoids
- minGPT — Karpathy's earlier project; nanoGPT is the faster, more optimized successor
FAQ
Q: Can I train a production-grade LLM with nanoGPT? A: It is optimized for learning and reproducing GPT-2. For production-scale training, frameworks like Megatron-LM or LLaMA-Factory are more appropriate.
Q: What hardware do I need? A: A single consumer GPU with 8 GB VRAM can train the character-level Shakespeare model. Reproducing GPT-2 124M requires multiple A100s.
Q: Does it support LoRA or adapter-based fine-tuning? A: Not natively. The codebase does full-parameter fine-tuning. Community forks add PEFT methods.
Q: Is the code actively maintained? A: The repository is intentionally minimal and stable. Updates are infrequent by design.