# llm.c — LLM Training in Simple Raw C/CUDA

> Train large language models in pure C and CUDA without any deep learning framework. Created by Andrej Karpathy, llm.c demonstrates that GPT-2 training can be expressed in roughly 1,000 lines of C code.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# llm.c — LLM Training in Simple Raw C/CUDA

## Quick Use
```bash
git clone https://github.com/karpathy/llm.c.git && cd llm.c
pip install -r requirements.txt
python dev/data/tinyshakespeare.py
make train_gpt2cu
./train_gpt2cu
```

## Introduction
llm.c is a project by Andrej Karpathy that implements GPT-2 training in raw C and CUDA, with no dependency on PyTorch, TensorFlow, or any other framework. It serves as both an educational resource and a surprisingly performant training pipeline.

## What llm.c Does
- Trains GPT-2 (124M) from scratch using only C and CUDA kernels
- Achieves performance competitive with PyTorch on equivalent hardware
- Provides a pure C CPU implementation alongside the CUDA GPU version
- Includes data preparation scripts for converting text to binary token format
- Supports multi-GPU training via MPI and NCCL

## Architecture Overview
The codebase implements forward and backward passes for every transformer layer as explicit C/CUDA functions: embedding lookups, layer normalization, matrix multiplications, GELU, softmax, and cross-entropy loss. Memory is managed manually with a single large allocation. Gradient computation is hand-derived and fused into efficient CUDA kernels.

## Self-Hosting & Configuration
- Requires a C compiler (gcc/clang) and CUDA toolkit for GPU builds
- CPU-only build works with just `make train_gpt2`
- Data preparation uses Python scripts to tokenize and serialize text
- Model hyperparameters are set via command-line flags
- Multi-GPU requires OpenMPI and NCCL libraries installed

## Key Features
- Entire GPT-2 training loop in roughly 1,000 lines of C
- No framework dependency eliminates version conflicts and overhead
- Hand-written CUDA kernels match or exceed cuBLAS in targeted operations
- Mixed precision (FP32/BF16) support for modern GPUs
- Direct checkpoint compatibility with Hugging Face GPT-2 weights

## Comparison with Similar Tools
- **nanoGPT** — Karpathy's Python implementation; llm.c goes one level lower to raw C/CUDA
- **PyTorch** — general-purpose framework; llm.c trades flexibility for transparency and minimal dependencies
- **Megatron-LM** — enterprise-grade multi-node training; llm.c prioritizes simplicity over scale
- **tinygrad** — minimal deep learning framework in Python; llm.c eliminates even the framework abstraction

## FAQ
**Q: Is llm.c suitable for production training?**
A: It is primarily educational but its performance is competitive. Production use cases typically benefit from framework ecosystem features.

**Q: Can I train models larger than GPT-2?**
A: The architecture supports scaling up, but the project focuses on GPT-2 as a reference implementation.

**Q: Does it support ARM or Apple Silicon?**
A: The CPU path compiles on ARM. GPU acceleration requires NVIDIA CUDA hardware.

**Q: How does performance compare to PyTorch?**
A: On a single A100, llm.c matches PyTorch nanoGPT throughput while using less host memory.

## Sources
- https://github.com/karpathy/llm.c
- https://github.com/karpathy/llm.c/blob/master/README.md

---
Source: https://tokrepo.com/en/workflows/llm-c-llm-training-simple-raw-c-cuda-8cefedb2
Author: Script Depot