Introduction
minimind is an open-source educational project that demystifies LLM training by providing a complete pipeline to train a 64M-parameter language model from scratch in approximately two hours on a single consumer GPU. It covers pretraining, supervised fine-tuning, and DPO alignment.
What minimind Does
- Trains a compact language model from scratch with full pretraining on a text corpus
- Implements supervised fine-tuning (SFT) for instruction-following capabilities
- Includes DPO (Direct Preference Optimization) for basic alignment
- Provides an interactive web demo for chatting with the trained model
- Documents every training stage with clear explanations in both Chinese and English
Architecture Overview
minimind implements a decoder-only transformer architecture with rotary position embeddings, grouped query attention, and SwiGLU activation. The model uses a custom tokenizer trained on the same corpus. The training pipeline is built with PyTorch and supports distributed training via DDP, though a single GPU is sufficient for the default 64M configuration.
Self-Hosting & Configuration
- Requires Python 3.9+ with PyTorch and a CUDA GPU (minimum 8GB VRAM)
- Pretraining data is included or can be replaced with custom text corpora
- Training configs control model size (26M to 218M parameters), learning rate, and batch size
- The web demo runs locally with Gradio, accessible through a browser
- Full training from scratch completes in about 2 hours on an RTX 3090
Key Features
- End-to-end LLM training in minimal, readable code with extensive documentation
- Multiple model sizes from 26M to 218M parameters for different hardware budgets
- Complete pipeline covering tokenizer training, pretraining, SFT, and DPO alignment
- Bilingual documentation (Chinese and English) making it accessible to a global audience
- Modular design allows swapping components like attention mechanisms and position encodings
Comparison with Similar Tools
- nanochat — Karpathy's chat-focused trainer; minimind focuses on the full pretraining pipeline with smaller models
- nanoGPT — pretraining only; minimind adds SFT and DPO stages for a complete chat model
- LitGPT — production fine-tuning toolkit; minimind prioritizes educational clarity over feature completeness
- Axolotl — advanced fine-tuning; minimind teaches fundamentals with a from-scratch approach
FAQ
Q: Can the trained model actually hold conversations? A: Yes. The 64M model handles simple conversations. Larger configs (218M) produce noticeably better results.
Q: What GPU is required? A: An 8GB VRAM GPU (e.g., RTX 3060) works for the smallest model. 16GB+ recommended for larger configs.
Q: Is this useful beyond education? A: The codebase serves as a starting point for custom small model development and domain-specific training experiments.
Q: How does it compare to fine-tuning a pretrained model? A: Training from scratch produces weaker models but provides complete understanding of the LLM pipeline. For production, fine-tuning is more practical.