Unsloth — Fine-Tune LLMs 2x Faster with 80% Less Memory
Fine-tune Llama, Mistral, Gemma, and Qwen models 2x faster using 80% less VRAM. Open-source with no accuracy loss. Train on a single GPU what used to need four.
Ready-to-run agent install
This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.
npx -y tokrepo@latest install 149b2641-d550-4faf-96de-3c6aee66ec58 --target codexRun after dry-run confirms the install plan.
What it is
Unsloth is an open-source library that accelerates LLM fine-tuning by 2x while reducing VRAM usage by up to 80%. It supports popular model families including Llama, Mistral, Gemma, and Qwen.
The tool targets ML engineers and hobbyists who want to fine-tune large language models on consumer-grade hardware. What previously required four GPUs can now run on a single card without sacrificing model accuracy.
The project is actively maintained and suitable for both individual developers and teams looking to integrate it into their existing toolchain. Documentation and community support are available for onboarding.
How it saves time or tokens
Unsloth's custom CUDA kernels and memory-efficient training loops cut fine-tuning time in half. The 80% VRAM reduction means you can train a 7B model on a single 24GB GPU instead of needing multi-GPU setups. This translates directly to lower cloud compute costs and faster iteration cycles. The estimated token budget for this workflow is around 4,100 tokens.
For teams evaluating multiple tools in the same category, the clear documentation and active community reduce the time spent on research and troubleshooting. Getting started takes minutes rather than hours of configuration.
How to use
- Install Unsloth via pip with your preferred PyTorch and CUDA versions.
- Load a base model (Llama 3, Mistral, Gemma, or Qwen) using the Unsloth fast loading path.
- Apply LoRA adapters with your chosen rank and target modules.
- Train using the standard Hugging Face Trainer or SFTTrainer with your dataset.
- Save or merge the adapter weights back into the base model for deployment.
Example
from unsloth import FastLanguageModel
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name='unsloth/llama-3-8b-bnb-4bit',
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_alpha=16,
lora_dropout=0,
)
Related on TokRepo
- Local LLM Tools — Compare inference engines that run fine-tuned models locally after training.
- AI Tools for Coding — Explore how fine-tuned models integrate into coding workflows.
Common pitfalls
- Using a CUDA version incompatible with Unsloth's kernels causes silent fallback to slower paths. Check the compatibility matrix before installing.
- Setting LoRA rank too high negates memory savings. Start with r=16 and increase only if quality metrics demand it.
- Forgetting to merge adapters before deployment means you ship two files instead of one, complicating inference setups.
- Applying the skill without reading the documentation first. Each skill has specific prerequisites and configuration requirements that affect the quality of results.
Frequently Asked Questions
Unsloth supports Llama 2 and 3, Mistral, Gemma, Qwen, and other popular model families. The library provides pre-configured fast-loading paths for common model variants including 4-bit quantized versions.
No. Unsloth relies on custom CUDA kernels for its speed and memory optimizations. A CUDA-compatible NVIDIA GPU with at least 8GB VRAM is required. AMD GPUs are not supported.
Unsloth uses gradient checkpointing, custom triton kernels, and memory-efficient attention implementations to reduce peak VRAM usage. Combined with 4-bit quantization (QLoRA), this allows 7B models to fine-tune on a single 24GB card.
Yes. Unsloth models are compatible with the standard Hugging Face SFTTrainer and Trainer classes. You prepare your model with Unsloth's fast loading, then pass it to the trainer as usual.
No. Unsloth's optimizations are mathematically equivalent to standard training. The speedup comes from kernel-level optimizations, not from approximations that would reduce model quality.
Citations (3)
- Unsloth GitHub— 2x faster fine-tuning with 80% less VRAM
- Unsloth Documentation— QLoRA and 4-bit quantization support
- LoRA Paper (Hu et al.)— LoRA adapter fine-tuning methodology
Related on TokRepo
Source & Thanks
Created by Unsloth AI. Licensed under Apache 2.0.
unslothai/unsloth — 25k+ stars
Discussion
Related Assets
LitGPT — Fine-Tune and Deploy AI Models Simply
Lightning AI's framework for fine-tuning and serving 20+ LLM families. LitGPT supports LoRA, QLoRA, full fine-tuning with one-command training on consumer hardware.
Unsloth — 2x Faster Local LLM Training & Inference
Unsloth is a unified local interface for running and training AI models. 58.7K+ GitHub stars. 2x faster training with 70% less VRAM across 500+ models including Qwen, DeepSeek, Llama, Gemma. Web UI wi
LLaMA-Factory — Fine-Tune 100+ LLMs with a Unified Interface
LLaMA-Factory provides a web UI and CLI to fine-tune large language models including LLaMA, Mistral, Qwen, and more using LoRA, QLoRA, and full-parameter methods without writing training scripts.
QLoRA — Memory-Efficient Fine-Tuning for Quantized LLMs
QLoRA enables fine-tuning of large language models on consumer GPUs by backpropagating gradients through a frozen 4-bit quantized model into Low-Rank Adapters. It reduces memory requirements enough to fine-tune a 65B parameter model on a single 48GB GPU while preserving full 16-bit performance.