Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsApr 21, 2026·3 min de lecture

PEFT — Parameter-Efficient Fine-Tuning for Large Language Models

PEFT is a Hugging Face library for adapting large pre-trained models using parameter-efficient methods like LoRA, QLoRA, prompt tuning, and prefix tuning. It enables fine-tuning billion-parameter models on consumer hardware by updating only a small fraction of weights.

Introduction

PEFT (Parameter-Efficient Fine-Tuning) is a Hugging Face library that makes it practical to fine-tune large language models without requiring massive GPU clusters. By updating only a tiny subset of model parameters, PEFT methods like LoRA achieve results close to full fine-tuning while using a fraction of the memory and compute.

What PEFT Does

  • Applies LoRA (Low-Rank Adaptation) to inject trainable rank-decomposition matrices into model layers
  • Supports QLoRA for fine-tuning quantized 4-bit models on a single GPU
  • Implements prompt tuning, prefix tuning, and P-tuning for soft-prompt based adaptation
  • Provides adapter methods including IA3 and AdaLoRA for different efficiency-accuracy tradeoffs
  • Integrates seamlessly with Hugging Face Transformers, Diffusers, and Accelerate

Architecture Overview

PEFT wraps a pre-trained model by injecting small trainable modules while freezing the original weights. For LoRA, this means adding low-rank matrices A and B to attention layers such that the effective weight becomes W + BA. During training only A and B are updated, reducing trainable parameters by 99%+. The adapter weights are saved separately and can be merged back into the base model for deployment.

Self-Hosting & Configuration

  • Install via pip: pip install peft alongside transformers and accelerate
  • LoRA config requires choosing r (rank), lora_alpha (scaling), and target_modules (which layers to adapt)
  • QLoRA setup needs bitsandbytes for 4-bit quantization: pip install bitsandbytes
  • Save adapters with model.save_pretrained() — only adapter weights are saved (typically 10-50 MB)
  • Load and merge: PeftModel.from_pretrained(base_model, adapter_path).merge_and_unload()

Key Features

  • Train 7B+ parameter models on a single consumer GPU with QLoRA
  • Multiple adapter methods: LoRA, AdaLoRA, IA3, prompt tuning, prefix tuning
  • Adapter composition allows stacking and combining multiple fine-tuned adapters
  • Native integration with Hugging Face Hub for sharing and loading community adapters
  • Supports multi-adapter inference for switching between tasks without reloading models

Comparison with Similar Tools

  • Full Fine-Tuning — updates all parameters for best accuracy but requires 4-10x more GPU memory
  • Unsloth — optimized LoRA training with 2x speed gains but narrower model support
  • LLaMA-Factory — GUI-driven fine-tuning with PEFT methods built in but less flexible for custom setups
  • Axolotl — config-driven fine-tuning wrapper that uses PEFT under the hood
  • OpenDelta — alternative PEFT library from Tsinghua with similar methods but smaller community

FAQ

Q: How much memory does LoRA save compared to full fine-tuning? A: LoRA typically reduces trainable parameters by 99%+ and GPU memory by 60-80%. A 7B model that needs 60 GB for full fine-tuning can be LoRA-trained in 16 GB with QLoRA.

Q: Does LoRA fine-tuning match full fine-tuning quality? A: For most downstream tasks, LoRA with rank 16-64 achieves 95-100% of full fine-tuning performance. Tasks requiring broad knowledge changes may benefit from higher rank.

Q: Can I combine multiple LoRA adapters? A: Yes, PEFT supports adapter composition via add_adapter() and weighted merging to combine skills from different fine-tunes.

Q: What models work with PEFT? A: Any Hugging Face Transformers or Diffusers model. This includes LLaMA, Mistral, Gemma, Stable Diffusion, and hundreds more.

Sources

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires