ScriptsMay 2, 2026·2 min read

llm.c — LLM Training in Simple Raw C/CUDA

Train large language models in pure C and CUDA without any deep learning framework. Created by Andrej Karpathy, llm.c demonstrates that GPT-2 training can be expressed in roughly 1,000 lines of C code.

Introduction

llm.c is a project by Andrej Karpathy that implements GPT-2 training in raw C and CUDA, with no dependency on PyTorch, TensorFlow, or any other framework. It serves as both an educational resource and a surprisingly performant training pipeline.

What llm.c Does

  • Trains GPT-2 (124M) from scratch using only C and CUDA kernels
  • Achieves performance competitive with PyTorch on equivalent hardware
  • Provides a pure C CPU implementation alongside the CUDA GPU version
  • Includes data preparation scripts for converting text to binary token format
  • Supports multi-GPU training via MPI and NCCL

Architecture Overview

The codebase implements forward and backward passes for every transformer layer as explicit C/CUDA functions: embedding lookups, layer normalization, matrix multiplications, GELU, softmax, and cross-entropy loss. Memory is managed manually with a single large allocation. Gradient computation is hand-derived and fused into efficient CUDA kernels.

Self-Hosting & Configuration

  • Requires a C compiler (gcc/clang) and CUDA toolkit for GPU builds
  • CPU-only build works with just make train_gpt2
  • Data preparation uses Python scripts to tokenize and serialize text
  • Model hyperparameters are set via command-line flags
  • Multi-GPU requires OpenMPI and NCCL libraries installed

Key Features

  • Entire GPT-2 training loop in roughly 1,000 lines of C
  • No framework dependency eliminates version conflicts and overhead
  • Hand-written CUDA kernels match or exceed cuBLAS in targeted operations
  • Mixed precision (FP32/BF16) support for modern GPUs
  • Direct checkpoint compatibility with Hugging Face GPT-2 weights

Comparison with Similar Tools

  • nanoGPT — Karpathy's Python implementation; llm.c goes one level lower to raw C/CUDA
  • PyTorch — general-purpose framework; llm.c trades flexibility for transparency and minimal dependencies
  • Megatron-LM — enterprise-grade multi-node training; llm.c prioritizes simplicity over scale
  • tinygrad — minimal deep learning framework in Python; llm.c eliminates even the framework abstraction

FAQ

Q: Is llm.c suitable for production training? A: It is primarily educational but its performance is competitive. Production use cases typically benefit from framework ecosystem features.

Q: Can I train models larger than GPT-2? A: The architecture supports scaling up, but the project focuses on GPT-2 as a reference implementation.

Q: Does it support ARM or Apple Silicon? A: The CPU path compiles on ARM. GPU acceleration requires NVIDIA CUDA hardware.

Q: How does performance compare to PyTorch? A: On a single A100, llm.c matches PyTorch nanoGPT throughput while using less host memory.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets