How do I install minimind — Train a 64M-Parameter LLM from Scratch in 2 Hours?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

minimind — Train a 64M-Parameter LLM from Scratch in 2 Hours

Introduction

minimind is an open-source educational project that demystifies LLM training by providing a complete pipeline to train a 64M-parameter language model from scratch in approximately two hours on a single consumer GPU. It covers pretraining, supervised fine-tuning, and DPO alignment.

What minimind Does

Trains a compact language model from scratch with full pretraining on a text corpus
Implements supervised fine-tuning (SFT) for instruction-following capabilities
Includes DPO (Direct Preference Optimization) for basic alignment
Provides an interactive web demo for chatting with the trained model
Documents every training stage with clear explanations in both Chinese and English

Architecture Overview

minimind implements a decoder-only transformer architecture with rotary position embeddings, grouped query attention, and SwiGLU activation. The model uses a custom tokenizer trained on the same corpus. The training pipeline is built with PyTorch and supports distributed training via DDP, though a single GPU is sufficient for the default 64M configuration.

Self-Hosting & Configuration

Requires Python 3.9+ with PyTorch and a CUDA GPU (minimum 8GB VRAM)
Pretraining data is included or can be replaced with custom text corpora
Training configs control model size (26M to 218M parameters), learning rate, and batch size
The web demo runs locally with Gradio, accessible through a browser
Full training from scratch completes in about 2 hours on an RTX 3090

Key Features

End-to-end LLM training in minimal, readable code with extensive documentation
Multiple model sizes from 26M to 218M parameters for different hardware budgets
Complete pipeline covering tokenizer training, pretraining, SFT, and DPO alignment
Bilingual documentation (Chinese and English) making it accessible to a global audience
Modular design allows swapping components like attention mechanisms and position encodings

Comparison with Similar Tools

nanochat — Karpathy's chat-focused trainer; minimind focuses on the full pretraining pipeline with smaller models
nanoGPT — pretraining only; minimind adds SFT and DPO stages for a complete chat model
LitGPT — production fine-tuning toolkit; minimind prioritizes educational clarity over feature completeness
Axolotl — advanced fine-tuning; minimind teaches fundamentals with a from-scratch approach

FAQ

Q: Can the trained model actually hold conversations? A: Yes. The 64M model handles simple conversations. Larger configs (218M) produce noticeably better results.

Q: What GPU is required? A: An 8GB VRAM GPU (e.g., RTX 3060) works for the smallest model. 16GB+ recommended for larger configs.

Q: Is this useful beyond education? A: The codebase serves as a starting point for custom small model development and domain-specific training experiments.

Q: How does it compare to fine-tuning a pretrained model? A: Training from scratch produces weaker models but provides complete understanding of the LLM pipeline. For production, fine-tuning is more practical.

Sources

https://github.com/jingyaogong/minimind

minimind — Train a 64M-Parameter LLM from Scratch in 2 Hours

Introduction

What minimind Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

LLM Foundry — LLM Training Code for Foundation Models by Databricks

GPT-NeoX — Open-Source Large Language Model Training Library

Liger-Kernel — Efficient GPU Kernels for LLM Training

Unsloth — 2x Faster Local LLM Training & Inference