# llm.c — LLM Training in Simple Raw C/CUDA > Train large language models in pure C and CUDA without any deep learning framework. Created by Andrej Karpathy, llm.c demonstrates that GPT-2 training can be expressed in roughly 1,000 lines of C code. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # llm.c — LLM Training in Simple Raw C/CUDA ## Quick Use ```bash git clone https://github.com/karpathy/llm.c.git && cd llm.c pip install -r requirements.txt python dev/data/tinyshakespeare.py make train_gpt2cu ./train_gpt2cu ``` ## Introduction llm.c is a project by Andrej Karpathy that implements GPT-2 training in raw C and CUDA, with no dependency on PyTorch, TensorFlow, or any other framework. It serves as both an educational resource and a surprisingly performant training pipeline. ## What llm.c Does - Trains GPT-2 (124M) from scratch using only C and CUDA kernels - Achieves performance competitive with PyTorch on equivalent hardware - Provides a pure C CPU implementation alongside the CUDA GPU version - Includes data preparation scripts for converting text to binary token format - Supports multi-GPU training via MPI and NCCL ## Architecture Overview The codebase implements forward and backward passes for every transformer layer as explicit C/CUDA functions: embedding lookups, layer normalization, matrix multiplications, GELU, softmax, and cross-entropy loss. Memory is managed manually with a single large allocation. Gradient computation is hand-derived and fused into efficient CUDA kernels. ## Self-Hosting & Configuration - Requires a C compiler (gcc/clang) and CUDA toolkit for GPU builds - CPU-only build works with just `make train_gpt2` - Data preparation uses Python scripts to tokenize and serialize text - Model hyperparameters are set via command-line flags - Multi-GPU requires OpenMPI and NCCL libraries installed ## Key Features - Entire GPT-2 training loop in roughly 1,000 lines of C - No framework dependency eliminates version conflicts and overhead - Hand-written CUDA kernels match or exceed cuBLAS in targeted operations - Mixed precision (FP32/BF16) support for modern GPUs - Direct checkpoint compatibility with Hugging Face GPT-2 weights ## Comparison with Similar Tools - **nanoGPT** — Karpathy's Python implementation; llm.c goes one level lower to raw C/CUDA - **PyTorch** — general-purpose framework; llm.c trades flexibility for transparency and minimal dependencies - **Megatron-LM** — enterprise-grade multi-node training; llm.c prioritizes simplicity over scale - **tinygrad** — minimal deep learning framework in Python; llm.c eliminates even the framework abstraction ## FAQ **Q: Is llm.c suitable for production training?** A: It is primarily educational but its performance is competitive. Production use cases typically benefit from framework ecosystem features. **Q: Can I train models larger than GPT-2?** A: The architecture supports scaling up, but the project focuses on GPT-2 as a reference implementation. **Q: Does it support ARM or Apple Silicon?** A: The CPU path compiles on ARM. GPU acceleration requires NVIDIA CUDA hardware. **Q: How does performance compare to PyTorch?** A: On a single A100, llm.c matches PyTorch nanoGPT throughput while using less host memory. ## Sources - https://github.com/karpathy/llm.c - https://github.com/karpathy/llm.c/blob/master/README.md --- Source: https://tokrepo.com/en/workflows/llm-c-llm-training-simple-raw-c-cuda-8cefedb2 Author: Script Depot