Introduction
llm.c is a project by Andrej Karpathy that implements GPT-2 training in raw C and CUDA, with no dependency on PyTorch, TensorFlow, or any other framework. It serves as both an educational resource and a surprisingly performant training pipeline.
What llm.c Does
- Trains GPT-2 (124M) from scratch using only C and CUDA kernels
- Achieves performance competitive with PyTorch on equivalent hardware
- Provides a pure C CPU implementation alongside the CUDA GPU version
- Includes data preparation scripts for converting text to binary token format
- Supports multi-GPU training via MPI and NCCL
Architecture Overview
The codebase implements forward and backward passes for every transformer layer as explicit C/CUDA functions: embedding lookups, layer normalization, matrix multiplications, GELU, softmax, and cross-entropy loss. Memory is managed manually with a single large allocation. Gradient computation is hand-derived and fused into efficient CUDA kernels.
Self-Hosting & Configuration
- Requires a C compiler (gcc/clang) and CUDA toolkit for GPU builds
- CPU-only build works with just
make train_gpt2 - Data preparation uses Python scripts to tokenize and serialize text
- Model hyperparameters are set via command-line flags
- Multi-GPU requires OpenMPI and NCCL libraries installed
Key Features
- Entire GPT-2 training loop in roughly 1,000 lines of C
- No framework dependency eliminates version conflicts and overhead
- Hand-written CUDA kernels match or exceed cuBLAS in targeted operations
- Mixed precision (FP32/BF16) support for modern GPUs
- Direct checkpoint compatibility with Hugging Face GPT-2 weights
Comparison with Similar Tools
- nanoGPT — Karpathy's Python implementation; llm.c goes one level lower to raw C/CUDA
- PyTorch — general-purpose framework; llm.c trades flexibility for transparency and minimal dependencies
- Megatron-LM — enterprise-grade multi-node training; llm.c prioritizes simplicity over scale
- tinygrad — minimal deep learning framework in Python; llm.c eliminates even the framework abstraction
FAQ
Q: Is llm.c suitable for production training? A: It is primarily educational but its performance is competitive. Production use cases typically benefit from framework ecosystem features.
Q: Can I train models larger than GPT-2? A: The architecture supports scaling up, but the project focuses on GPT-2 as a reference implementation.
Q: Does it support ARM or Apple Silicon? A: The CPU path compiles on ARM. GPU acceleration requires NVIDIA CUDA hardware.
Q: How does performance compare to PyTorch? A: On a single A100, llm.c matches PyTorch nanoGPT throughput while using less host memory.