Unsloth — 2x Faster Local LLM Training & Inference
Unsloth is a unified local interface for running and training AI models. 58.7K+ GitHub stars. 2x faster training with 70% less VRAM across 500+ models including Qwen, DeepSeek, Llama, Gemma. Web UI wi
What it is
Unsloth is a unified local interface for running and training AI models. It provides up to 2x faster training with 70% less VRAM usage across 500+ models including Qwen, DeepSeek, Llama, and Gemma. It includes a web UI with one-click fine-tuning, a CLI for automated workflows, and full compatibility with the Hugging Face ecosystem.
It targets ML engineers and developers who want to fine-tune LLMs on consumer GPUs without expensive cloud compute, and researchers who need faster iteration cycles.
How it saves time or tokens
Unsloth's memory optimizations let you fine-tune models that would otherwise require multiple expensive GPUs on a single consumer GPU. A model that needs 48GB VRAM with standard training may need only 14GB with Unsloth. This means you can fine-tune on an RTX 4090 instead of renting an A100, saving significant compute costs.
How to use
- Install:
curl -fsSL https://unsloth.ai/install.sh | sh
Or via pip:
pip install unsloth
- Fine-tune a model:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name='unsloth/Llama-3.2-3B-Instruct',
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=16,
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
)
# Train with your dataset using standard HuggingFace Trainer
- Or use the web UI for no-code fine-tuning.
Example
| Metric | Standard training | Unsloth |
|---|---|---|
| Training speed | 1x | 2x |
| VRAM usage | 100% | 30% |
| RTX 4090 max model | 7B | 20B+ |
| Cost (cloud A100) | $3/hr | $1.50/hr (half the time) |
Related on TokRepo
- Local LLM tools -- tools for running LLMs locally
- AI tools for coding -- developer tools for AI
Common pitfalls
- Unsloth optimizations are specific to certain model architectures. Check compatibility before starting a training run with a new model.
- 4-bit training (QLoRA) reduces VRAM usage further but may slightly affect model quality compared to full-precision LoRA.
- The web UI is convenient for getting started but the Python API provides more control for advanced training configurations.
Frequently Asked Questions
Unsloth supports NVIDIA GPUs with CUDA (RTX 3060 and newer are recommended). Apple Silicon support is available through the MLX backend. AMD GPUs have experimental support via ROCm. The VRAM savings are most impactful on consumer GPUs like the RTX 4090 where memory is limited.
Yes. Unsloth fully supports LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) training methods. QLoRA combines 4-bit quantization with LoRA to minimize VRAM usage. Both methods produce models compatible with the standard Hugging Face ecosystem.
Yes. Unsloth can export trained models to GGUF format for use with llama.cpp, Ollama, and other inference engines. This lets you train with Unsloth and deploy with your preferred serving solution. The export handles quantization and format conversion automatically.
Unsloth has an open-source version that is free for personal and commercial use. A Pro version offers additional features like longer context support, more model architectures, and priority support. The free version covers most common fine-tuning use cases.
Unsloth uses custom CUDA kernels optimized for transformer attention patterns, intelligent memory management that reduces fragmentation, and efficient gradient checkpointing. These optimizations are applied automatically when you load a model through Unsloth's API. No manual tuning is needed.
Citations (3)
- Unsloth GitHub— Unsloth training optimization framework
- Unsloth Docs— Unsloth documentation and installation
- QLoRA Paper (arXiv)— QLoRA quantized fine-tuning method
Related on TokRepo
Source & Thanks
Created by Unsloth AI. Licensed under Apache 2.0 / AGPL-3.0. unslothai/unsloth — 58,700+ GitHub stars
Discussion
Related Assets
Cucumber.js — BDD Testing with Plain Language Scenarios
Cucumber.js is a JavaScript implementation of Cucumber that runs automated tests written in Gherkin plain language.
WireMock — Flexible API Mocking for Java and Beyond
WireMock is an HTTP mock server for stubbing and verifying API calls in integration tests and development.
Google Benchmark — Microbenchmark Library for C++
Google Benchmark is a library for measuring and reporting the performance of C++ code with statistical rigor.