TRL — Post-Training LLMs with RLHF & DPO
TRL is a Hugging Face library for post-training foundation models. 17.9K+ GitHub stars. SFT, GRPO, DPO, reward modeling. Scales from single GPU to multi-node. Apache 2.0.
What it is
TRL (Transformer Reinforcement Learning) is a Hugging Face library for post-training foundation models. It implements supervised fine-tuning (SFT), direct preference optimization (DPO), group relative policy optimization (GRPO), reward modeling, and reinforcement learning from human feedback (RLHF). It scales from a single consumer GPU to multi-node clusters.
TRL targets ML engineers and researchers fine-tuning language models to follow instructions, align with preferences, or specialize for specific tasks.
How it saves time or tokens
TRL wraps complex training loops (reward modeling, PPO, DPO) into clean trainer classes. Instead of implementing RLHF from scratch with custom loss functions and rollout buffers, you configure a DPOTrainer or GRPOTrainer with a few parameters and call .train().
Integration with Hugging Face Transformers and PEFT means you can fine-tune with LoRA or QLoRA to fit large models on consumer hardware.
How to use
- Install TRL:
pip install trl - Prepare your dataset in the required format (prompt-chosen-rejected for DPO)
- Configure the trainer with model, dataset, and training arguments
- Run training and push the result to Hugging Face Hub
Example
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8B')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3-8B')
dataset = load_dataset('trl-lib/ultrafeedback_binarized', split='train')
training_args = DPOConfig(
output_dir='./dpo-llama3',
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-7,
num_train_epochs=1,
)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
trainer.push_to_hub()
Related on TokRepo
- Coding tools -- ML development tools
- Research tools -- AI research frameworks
Common pitfalls
- DPO requires paired preference data (chosen vs rejected); poor-quality preference labels produce models that optimize for the wrong signal
- GRPO is memory-intensive due to group sampling; reduce the group size if you hit OOM errors
- Learning rates for alignment fine-tuning are much smaller than for SFT (typically 1e-7 to 5e-6); using SFT-scale learning rates causes catastrophic forgetting
Frequently Asked Questions
RLHF trains a separate reward model and uses PPO to optimize the policy. DPO skips the reward model and directly optimizes the policy from preference pairs. DPO is simpler, more stable, and requires less compute. TRL supports both approaches.
Yes. TRL integrates with PEFT (Parameter Efficient Fine-Tuning). Pass a PEFT config to the trainer to fine-tune with LoRA or QLoRA. This allows training large models on consumer GPUs by only updating a small fraction of parameters.
For SFT and DPO with LoRA on a 7-8B model, a single GPU with 24 GB VRAM (RTX 3090, A5000) is sufficient. Full-parameter training of larger models requires multi-GPU setups with DeepSpeed or FSDP.
Group Relative Policy Optimization samples multiple completions for each prompt, scores them, and uses the relative rankings within the group as the training signal. It avoids the need for a separate reward model while being more sample-efficient than DPO for some tasks.
Yes. TRL builds on Hugging Face Accelerate, which supports multi-GPU and multi-node training via DeepSpeed, FSDP, and Megatron-LM. Configure your accelerate config and TRL handles the distributed training loop.
Citations (3)
- TRL GitHub— TRL is a Hugging Face library for post-training with 17.9K+ GitHub stars
- arXiv— Direct Preference Optimization paper
- Hugging Face Blog— RLHF training methodology for language models
Related on TokRepo
Source & Thanks
Created by Hugging Face. Licensed under Apache 2.0. huggingface/trl — 17,900+ GitHub stars
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.