Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 2, 2026·2 min de lecture

llm.c — LLM Training in Simple Raw C/CUDA

Train large language models in pure C and CUDA without any deep learning framework. Created by Andrej Karpathy, llm.c demonstrates that GPT-2 training can be expressed in roughly 1,000 lines of C code.

Introduction

llm.c is a project by Andrej Karpathy that implements GPT-2 training in raw C and CUDA, with no dependency on PyTorch, TensorFlow, or any other framework. It serves as both an educational resource and a surprisingly performant training pipeline.

What llm.c Does

  • Trains GPT-2 (124M) from scratch using only C and CUDA kernels
  • Achieves performance competitive with PyTorch on equivalent hardware
  • Provides a pure C CPU implementation alongside the CUDA GPU version
  • Includes data preparation scripts for converting text to binary token format
  • Supports multi-GPU training via MPI and NCCL

Architecture Overview

The codebase implements forward and backward passes for every transformer layer as explicit C/CUDA functions: embedding lookups, layer normalization, matrix multiplications, GELU, softmax, and cross-entropy loss. Memory is managed manually with a single large allocation. Gradient computation is hand-derived and fused into efficient CUDA kernels.

Self-Hosting & Configuration

  • Requires a C compiler (gcc/clang) and CUDA toolkit for GPU builds
  • CPU-only build works with just make train_gpt2
  • Data preparation uses Python scripts to tokenize and serialize text
  • Model hyperparameters are set via command-line flags
  • Multi-GPU requires OpenMPI and NCCL libraries installed

Key Features

  • Entire GPT-2 training loop in roughly 1,000 lines of C
  • No framework dependency eliminates version conflicts and overhead
  • Hand-written CUDA kernels match or exceed cuBLAS in targeted operations
  • Mixed precision (FP32/BF16) support for modern GPUs
  • Direct checkpoint compatibility with Hugging Face GPT-2 weights

Comparison with Similar Tools

  • nanoGPT — Karpathy's Python implementation; llm.c goes one level lower to raw C/CUDA
  • PyTorch — general-purpose framework; llm.c trades flexibility for transparency and minimal dependencies
  • Megatron-LM — enterprise-grade multi-node training; llm.c prioritizes simplicity over scale
  • tinygrad — minimal deep learning framework in Python; llm.c eliminates even the framework abstraction

FAQ

Q: Is llm.c suitable for production training? A: It is primarily educational but its performance is competitive. Production use cases typically benefit from framework ecosystem features.

Q: Can I train models larger than GPT-2? A: The architecture supports scaling up, but the project focuses on GPT-2 as a reference implementation.

Q: Does it support ARM or Apple Silicon? A: The CPU path compiles on ARM. GPU acceleration requires NVIDIA CUDA hardware.

Q: How does performance compare to PyTorch? A: On a single A100, llm.c matches PyTorch nanoGPT throughput while using less host memory.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires