Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 13, 2026·3 min de lectura

minimind — Train a 64M-Parameter LLM from Scratch in 2 Hours

An open-source educational project that lets you train a small but functional language model from scratch on consumer hardware in about two hours, covering the full LLM training pipeline.

Introduction

minimind is an open-source educational project that demystifies LLM training by providing a complete pipeline to train a 64M-parameter language model from scratch in approximately two hours on a single consumer GPU. It covers pretraining, supervised fine-tuning, and DPO alignment.

What minimind Does

  • Trains a compact language model from scratch with full pretraining on a text corpus
  • Implements supervised fine-tuning (SFT) for instruction-following capabilities
  • Includes DPO (Direct Preference Optimization) for basic alignment
  • Provides an interactive web demo for chatting with the trained model
  • Documents every training stage with clear explanations in both Chinese and English

Architecture Overview

minimind implements a decoder-only transformer architecture with rotary position embeddings, grouped query attention, and SwiGLU activation. The model uses a custom tokenizer trained on the same corpus. The training pipeline is built with PyTorch and supports distributed training via DDP, though a single GPU is sufficient for the default 64M configuration.

Self-Hosting & Configuration

  • Requires Python 3.9+ with PyTorch and a CUDA GPU (minimum 8GB VRAM)
  • Pretraining data is included or can be replaced with custom text corpora
  • Training configs control model size (26M to 218M parameters), learning rate, and batch size
  • The web demo runs locally with Gradio, accessible through a browser
  • Full training from scratch completes in about 2 hours on an RTX 3090

Key Features

  • End-to-end LLM training in minimal, readable code with extensive documentation
  • Multiple model sizes from 26M to 218M parameters for different hardware budgets
  • Complete pipeline covering tokenizer training, pretraining, SFT, and DPO alignment
  • Bilingual documentation (Chinese and English) making it accessible to a global audience
  • Modular design allows swapping components like attention mechanisms and position encodings

Comparison with Similar Tools

  • nanochat — Karpathy's chat-focused trainer; minimind focuses on the full pretraining pipeline with smaller models
  • nanoGPT — pretraining only; minimind adds SFT and DPO stages for a complete chat model
  • LitGPT — production fine-tuning toolkit; minimind prioritizes educational clarity over feature completeness
  • Axolotl — advanced fine-tuning; minimind teaches fundamentals with a from-scratch approach

FAQ

Q: Can the trained model actually hold conversations? A: Yes. The 64M model handles simple conversations. Larger configs (218M) produce noticeably better results.

Q: What GPU is required? A: An 8GB VRAM GPU (e.g., RTX 3060) works for the smallest model. 16GB+ recommended for larger configs.

Q: Is this useful beyond education? A: The codebase serves as a starting point for custom small model development and domain-specific training experiments.

Q: How does it compare to fine-tuning a pretrained model? A: Training from scratch produces weaker models but provides complete understanding of the LLM pipeline. For production, fine-tuning is more practical.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados