What is llm.c — LLM Training in Simple Raw C/CUDA?

Train large language models in pure C and CUDA without any deep learning framework. Created by Andrej Karpathy, llm.c demonstrates that GPT-2 training can be expressed in roughly 1,000 lines of C code.

Is llm.c — LLM Training in Simple Raw C/CUDA free to use?

Yes. llm.c — LLM Training in Simple Raw C/CUDA is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install llm.c — LLM Training in Simple Raw C/CUDA?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

llm.c — LLM Training in Simple Raw C/CUDA

Introduction

llm.c is a project by Andrej Karpathy that implements GPT-2 training in raw C and CUDA, with no dependency on PyTorch, TensorFlow, or any other framework. It serves as both an educational resource and a surprisingly performant training pipeline.

What llm.c Does

Trains GPT-2 (124M) from scratch using only C and CUDA kernels
Achieves performance competitive with PyTorch on equivalent hardware
Provides a pure C CPU implementation alongside the CUDA GPU version
Includes data preparation scripts for converting text to binary token format
Supports multi-GPU training via MPI and NCCL

Architecture Overview

The codebase implements forward and backward passes for every transformer layer as explicit C/CUDA functions: embedding lookups, layer normalization, matrix multiplications, GELU, softmax, and cross-entropy loss. Memory is managed manually with a single large allocation. Gradient computation is hand-derived and fused into efficient CUDA kernels.

Self-Hosting & Configuration

Requires a C compiler (gcc/clang) and CUDA toolkit for GPU builds
CPU-only build works with just make train_gpt2
Data preparation uses Python scripts to tokenize and serialize text
Model hyperparameters are set via command-line flags
Multi-GPU requires OpenMPI and NCCL libraries installed

Key Features

Entire GPT-2 training loop in roughly 1,000 lines of C
No framework dependency eliminates version conflicts and overhead
Hand-written CUDA kernels match or exceed cuBLAS in targeted operations
Mixed precision (FP32/BF16) support for modern GPUs
Direct checkpoint compatibility with Hugging Face GPT-2 weights

Comparison with Similar Tools

nanoGPT — Karpathy's Python implementation; llm.c goes one level lower to raw C/CUDA
PyTorch — general-purpose framework; llm.c trades flexibility for transparency and minimal dependencies
Megatron-LM — enterprise-grade multi-node training; llm.c prioritizes simplicity over scale
tinygrad — minimal deep learning framework in Python; llm.c eliminates even the framework abstraction

FAQ

Q: Is llm.c suitable for production training? A: It is primarily educational but its performance is competitive. Production use cases typically benefit from framework ecosystem features.

Q: Can I train models larger than GPT-2? A: The architecture supports scaling up, but the project focuses on GPT-2 as a reference implementation.

Q: Does it support ARM or Apple Silicon? A: The CPU path compiles on ARM. GPU acceleration requires NVIDIA CUDA hardware.

Q: How does performance compare to PyTorch? A: On a single A100, llm.c matches PyTorch nanoGPT throughput while using less host memory.

llm.c — LLM Training in Simple Raw C/CUDA

Introduction

What llm.c Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

torchtune — PyTorch-Native LLM Fine-Tuning Library

xFormers — Flexible and Efficient Transformers Library

FlashAttention — Fast and Memory-Efficient Exact Attention