ScriptsMar 31, 2026·2 min read

TRL — Post-Training LLMs with RLHF & DPO

TRL is a Hugging Face library for post-training foundation models. 17.9K+ GitHub stars. SFT, GRPO, DPO, reward modeling. Scales from single GPU to multi-node. Apache 2.0.

TL;DR
TRL provides SFT, DPO, GRPO, and reward modeling for post-training LLMs at any scale.
§01

What it is

TRL (Transformer Reinforcement Learning) is a Hugging Face library for post-training foundation models. It implements supervised fine-tuning (SFT), direct preference optimization (DPO), group relative policy optimization (GRPO), reward modeling, and reinforcement learning from human feedback (RLHF). It scales from a single consumer GPU to multi-node clusters.

TRL targets ML engineers and researchers fine-tuning language models to follow instructions, align with preferences, or specialize for specific tasks.

§02

How it saves time or tokens

TRL wraps complex training loops (reward modeling, PPO, DPO) into clean trainer classes. Instead of implementing RLHF from scratch with custom loss functions and rollout buffers, you configure a DPOTrainer or GRPOTrainer with a few parameters and call .train().

Integration with Hugging Face Transformers and PEFT means you can fine-tune with LoRA or QLoRA to fit large models on consumer hardware.

§03

How to use

  1. Install TRL: pip install trl
  2. Prepare your dataset in the required format (prompt-chosen-rejected for DPO)
  3. Configure the trainer with model, dataset, and training arguments
  4. Run training and push the result to Hugging Face Hub
§04

Example

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8B')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3-8B')
dataset = load_dataset('trl-lib/ultrafeedback_binarized', split='train')

training_args = DPOConfig(
    output_dir='./dpo-llama3',
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    num_train_epochs=1,
)

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()
trainer.push_to_hub()
§05

Related on TokRepo

§06

Common pitfalls

  • DPO requires paired preference data (chosen vs rejected); poor-quality preference labels produce models that optimize for the wrong signal
  • GRPO is memory-intensive due to group sampling; reduce the group size if you hit OOM errors
  • Learning rates for alignment fine-tuning are much smaller than for SFT (typically 1e-7 to 5e-6); using SFT-scale learning rates causes catastrophic forgetting

Frequently Asked Questions

What is the difference between DPO and RLHF?+

RLHF trains a separate reward model and uses PPO to optimize the policy. DPO skips the reward model and directly optimizes the policy from preference pairs. DPO is simpler, more stable, and requires less compute. TRL supports both approaches.

Can I use TRL with LoRA or QLoRA?+

Yes. TRL integrates with PEFT (Parameter Efficient Fine-Tuning). Pass a PEFT config to the trainer to fine-tune with LoRA or QLoRA. This allows training large models on consumer GPUs by only updating a small fraction of parameters.

What hardware do I need for TRL?+

For SFT and DPO with LoRA on a 7-8B model, a single GPU with 24 GB VRAM (RTX 3090, A5000) is sufficient. Full-parameter training of larger models requires multi-GPU setups with DeepSpeed or FSDP.

What is GRPO?+

Group Relative Policy Optimization samples multiple completions for each prompt, scores them, and uses the relative rankings within the group as the training signal. It avoids the need for a separate reward model while being more sample-efficient than DPO for some tasks.

Does TRL support multi-node training?+

Yes. TRL builds on Hugging Face Accelerate, which supports multi-GPU and multi-node training via DeepSpeed, FSDP, and Megatron-LM. Configure your accelerate config and TRL handles the distributed training loop.

Citations (3)
  • TRL GitHub— TRL is a Hugging Face library for post-training with 17.9K+ GitHub stars
  • arXiv— Direct Preference Optimization paper
  • Hugging Face Blog— RLHF training methodology for language models
🙏

Source & Thanks

Created by Hugging Face. Licensed under Apache 2.0. huggingface/trl — 17,900+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets