ExLlamaV2 — Fast Quantized LLM Inference
ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.
What it is
ExLlamaV2 is an inference engine for running quantized large language models on consumer NVIDIA GPUs. It uses optimized CUDA kernels to deliver high throughput with formats including EXL2, GPTQ, and HQQ. Features include PagedAttention and speculative decoding for additional speed.
ExLlamaV2 targets hobbyists and developers who want to run 7B-70B parameter models on hardware with 8-24GB VRAM. It extracts maximum performance from quantized models that other frameworks leave on the table.
The project is actively maintained and suitable for both individual developers and teams looking to integrate it into their existing toolchain. Documentation and community support are available for onboarding.
How it saves time or tokens
ExLlamaV2's custom CUDA kernels process quantized weights faster than general-purpose frameworks. The EXL2 format allows per-layer quantization (mixing 2-bit and 6-bit) to balance quality and memory. PagedAttention reduces memory waste from KV cache, letting you run longer contexts. Speculative decoding speeds up generation by 1.5-2x for compatible models.
How to use
- Install ExLlamaV2 from pip or build from source with CUDA support.
- Download a quantized model in EXL2 or GPTQ format from Hugging Face.
- Load the model and run inference using the Python API or the built-in chat server.
- Configure batch size, context length, and speculative decoding based on your GPU VRAM.
Example
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
from exllamav2.generator import ExLlamaV2StreamingGenerator
# Load a quantized model
config = ExLlamaV2Config('models/Llama-3-8B-EXL2-4.0bpw/')
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)
model.load_autosplit(cache)
# Stream tokens
generator = ExLlamaV2StreamingGenerator(model, cache, model.tokenizer)
generator.set_stop_conditions([model.tokenizer.eos_token_id])
for token in generator.generate_simple('Explain transformers in 3 sentences.'):
print(token, end='', flush=True)
Related on TokRepo
- Local LLM Runners — Compare ExLlamaV2 with other local inference engines.
- AI Tools for Coding — Use locally-run models for AI-assisted coding.
Common pitfalls
- Choosing a model too large for your VRAM. A 70B model at 4-bit quantization needs about 35GB VRAM. Check the model card for VRAM requirements before downloading.
- Using GPTQ when EXL2 is available. EXL2 offers better quality-per-bit through per-layer quantization. Prefer EXL2 models when they exist for your target model.
- Not setting max_seq_len appropriately. Higher context lengths consume more VRAM for KV cache. Set it to the maximum you actually need, not the model's theoretical maximum.
- Not reading the changelog before upgrading. Breaking changes between versions can cause unexpected failures in production. Pin your version and review release notes.
Frequently Asked Questions
Any NVIDIA GPU with CUDA support works. For 7B models at 4-bit, you need at least 6GB VRAM. For 70B models, you need 24-48GB VRAM depending on quantization level. AMD GPUs are not supported.
EXL2 is ExLlamaV2's native quantization format. It allows mixed-precision quantization where different layers use different bit widths (2-8 bits). This preserves quality in sensitive layers while compressing others aggressively.
ExLlamaV2 is GPU-only and uses custom CUDA kernels for maximum GPU throughput. llama.cpp supports CPU inference and broader hardware (Apple Silicon, AMD). ExLlamaV2 is faster on NVIDIA GPUs; llama.cpp is more portable.
Yes. ExLlamaV2 supports batched inference with PagedAttention, allowing multiple concurrent requests to share GPU memory efficiently. This is useful for running a local API server.
No. ExLlamaV2 is an inference-only engine. For fine-tuning, use tools like Unsloth, Axolotl, or the Hugging Face Trainer. After fine-tuning, quantize the model to EXL2 format for fast inference with ExLlamaV2.
Citations (3)
- ExLlamaV2 GitHub— Fast quantized LLM inference with custom CUDA kernels
- ExLlamaV2 Wiki— EXL2 mixed-precision quantization format
- vLLM PagedAttention Paper— PagedAttention for efficient KV cache management
Related on TokRepo
Source & Thanks
Discussion
Related Assets
Moodle — Open-Source Learning Management System
The most widely used open-source learning platform, providing course management, assessments, and collaboration tools for educators and organizations worldwide.
Sylius — Headless E-Commerce Framework on Symfony
An open-source headless e-commerce platform built on Symfony and API Platform, designed for developers who need a customizable and API-first commerce solution.
Akaunting — Free Self-Hosted Accounting Software
A free, open-source online accounting application built on Laravel for small businesses and freelancers to manage invoices, expenses, and financial reports.