PromptsApr 7, 2026·2 min read

Unsloth — Fine-Tune LLMs 2x Faster with 80% Less Memory

Fine-tune Llama, Mistral, Gemma, and Qwen models 2x faster using 80% less VRAM. Open-source with no accuracy loss. Train on a single GPU what used to need four.

PR
Prompt Lab · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install unsloth
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

What is Unsloth?

Unsloth is an open-source library that makes LLM fine-tuning 2x faster while using 80% less GPU memory — with zero accuracy loss. It achieves this through custom CUDA kernels, intelligent memory management, and optimized backpropagation. Fine-tune a 70B model on a single 48GB GPU.

Answer-Ready: Unsloth is an open-source LLM fine-tuning library that achieves 2x speed and 80% VRAM reduction with no accuracy loss. Supports Llama, Mistral, Gemma, Qwen, and Phi models. Fine-tune 70B models on a single GPU. 25k+ GitHub stars.

Best for: ML engineers fine-tuning open-source LLMs with limited GPU resources. Works with: Llama 3, Mistral, Gemma 2, Qwen 2.5, Phi-3. Setup time: Under 5 minutes.

Core Features

1. 2x Faster Training

Custom Triton kernels optimize attention, RoPE, and cross-entropy:

Standard fine-tuning: 8 hours on A100
Unsloth fine-tuning:  4 hours on A100 (same accuracy)

2. 80% Less VRAM

Model Standard Unsloth
Llama 3 8B 24GB 6GB
Llama 3 70B 160GB 48GB
Mistral 7B 20GB 5GB

3. Full Training Pipeline

from unsloth import FastLanguageModel
from trl import SFTTrainer
from datasets import load_dataset

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.3-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=16)

# Train
dataset = load_dataset("your-dataset")
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=2048,
    dataset_text_field="text",
)
trainer.train()

# Save
model.save_pretrained("my-finetuned-model")

4. Export Formats

# Save as GGUF for Ollama
model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")

# Save merged 16-bit
model.save_pretrained_merged("model-merged", tokenizer)

# Push to Hugging Face
model.push_to_hub("username/my-model")

5. Supported Models

Model Family Sizes
Llama 3/3.1/3.3 1B, 3B, 8B, 70B
Mistral/Mixtral 7B, 8x7B
Gemma 2 2B, 9B, 27B
Qwen 2.5 0.5B-72B
Phi-3/3.5 3.8B, 14B

Google Colab

Free notebook to fine-tune on Colab's T4 GPU:

  • Llama 3 8B fine-tuning in 30 minutes
  • Uses only 7GB VRAM (fits T4's 15GB)

FAQ

Q: Does it affect model quality? A: No, Unsloth produces mathematically identical results. The speedup comes from kernel optimization, not approximation.

Q: Can I use it for full fine-tuning (not LoRA)? A: Yes, but LoRA with Unsloth gives the best speed/memory ratio.

Q: Does it work on consumer GPUs? A: Yes, fine-tune 8B models on RTX 3060 (12GB) or even Google Colab T4.

🙏

Source & Thanks

Created by Unsloth AI. Licensed under Apache 2.0.

unslothai/unsloth — 25k+ stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets