Introduction
torchtune is the official PyTorch library for authoring, fine-tuning, and experimenting with LLMs. It provides a clean, modular codebase with no trainer abstractions, giving users full control over the training loop while handling the complexity of modern fine-tuning methods.
What torchtune Does
- Fine-tunes LLMs using LoRA, QLoRA, full parameter tuning, and DoRA
- Supports alignment methods including DPO and PPO
- Provides recipes for single-GPU and multi-GPU distributed training
- Downloads and converts model weights from Hugging Face Hub
- Includes dataset utilities for instruction tuning, chat, and preference data
Architecture Overview
torchtune is built around recipes (complete training scripts) and configs (YAML-based hyperparameter files). Model definitions are pure PyTorch nn.Modules with no framework abstractions. LoRA and quantization are applied as composable transforms on the model layers. The library uses PyTorch Distributed for multi-GPU training and integrates with torchao for quantization-aware training.
Self-Hosting & Configuration
- Requires Python 3.9+ and PyTorch 2.4+
- Install via pip; no custom CUDA compilation needed
- YAML configs control model, dataset, optimizer, and training parameters
tuneCLI handles downloads, training, evaluation, and quantization- Single consumer GPU (24 GB) sufficient for LoRA on 7B models with QLoRA
Key Features
- No hidden trainer class; recipes are readable end-to-end training scripts
- Supports LLaMA 2/3, Mistral, Gemma, Phi, and Qwen model families
- Memory-efficient training via LoRA, QLoRA, activation checkpointing, and gradient accumulation
- Integrated with Weights & Biases and TensorBoard for experiment tracking
- Quantization support via torchao for 4-bit and 8-bit fine-tuning
Comparison with Similar Tools
- Hugging Face TRL — higher-level trainer API; torchtune gives more control with explicit training loops
- Axolotl — config-driven fine-tuning; torchtune uses transparent recipes instead of a monolithic trainer
- LLaMA-Factory — broad model and method support; torchtune prioritizes PyTorch-native composability
- Unsloth — focuses on inference and training speed hacks; torchtune focuses on correctness and modularity
FAQ
Q: Which models does torchtune support? A: LLaMA 2, LLaMA 3, LLaMA 3.2, Mistral, Gemma, Phi-3, Qwen-2.5, and more. New models are added regularly.
Q: Can I use torchtune for pre-training? A: It is designed for fine-tuning. Pre-training recipes are experimental.
Q: How much VRAM do I need for QLoRA on a 7B model? A: Approximately 10-12 GB, fitting on a single RTX 3080 or RTX 4090.
Q: Does torchtune support multi-node training? A: Yes, via PyTorch Distributed (FSDP). Multi-node recipes are provided.