ScriptsApr 1, 2026·1 min read

ExLlamaV2 — Fast Quantized LLM Inference

ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.

TL;DR
ExLlamaV2 runs quantized LLMs on consumer GPUs using optimized CUDA kernels with EXL2, GPTQ, and HQQ formats.
§01

What it is

ExLlamaV2 is an inference engine for running quantized large language models on consumer NVIDIA GPUs. It uses optimized CUDA kernels to deliver high throughput with formats including EXL2, GPTQ, and HQQ. Features include PagedAttention and speculative decoding for additional speed.

ExLlamaV2 targets hobbyists and developers who want to run 7B-70B parameter models on hardware with 8-24GB VRAM. It extracts maximum performance from quantized models that other frameworks leave on the table.

The project is actively maintained and suitable for both individual developers and teams looking to integrate it into their existing toolchain. Documentation and community support are available for onboarding.

§02

How it saves time or tokens

ExLlamaV2's custom CUDA kernels process quantized weights faster than general-purpose frameworks. The EXL2 format allows per-layer quantization (mixing 2-bit and 6-bit) to balance quality and memory. PagedAttention reduces memory waste from KV cache, letting you run longer contexts. Speculative decoding speeds up generation by 1.5-2x for compatible models.

§03

How to use

  1. Install ExLlamaV2 from pip or build from source with CUDA support.
  2. Download a quantized model in EXL2 or GPTQ format from Hugging Face.
  3. Load the model and run inference using the Python API or the built-in chat server.
  4. Configure batch size, context length, and speculative decoding based on your GPU VRAM.
§04

Example

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
from exllamav2.generator import ExLlamaV2StreamingGenerator

# Load a quantized model
config = ExLlamaV2Config('models/Llama-3-8B-EXL2-4.0bpw/')
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)
model.load_autosplit(cache)

# Stream tokens
generator = ExLlamaV2StreamingGenerator(model, cache, model.tokenizer)
generator.set_stop_conditions([model.tokenizer.eos_token_id])

for token in generator.generate_simple('Explain transformers in 3 sentences.'):
    print(token, end='', flush=True)
§05

Related on TokRepo

§06

Common pitfalls

  • Choosing a model too large for your VRAM. A 70B model at 4-bit quantization needs about 35GB VRAM. Check the model card for VRAM requirements before downloading.
  • Using GPTQ when EXL2 is available. EXL2 offers better quality-per-bit through per-layer quantization. Prefer EXL2 models when they exist for your target model.
  • Not setting max_seq_len appropriately. Higher context lengths consume more VRAM for KV cache. Set it to the maximum you actually need, not the model's theoretical maximum.
  • Not reading the changelog before upgrading. Breaking changes between versions can cause unexpected failures in production. Pin your version and review release notes.

Frequently Asked Questions

What GPU do I need for ExLlamaV2?+

Any NVIDIA GPU with CUDA support works. For 7B models at 4-bit, you need at least 6GB VRAM. For 70B models, you need 24-48GB VRAM depending on quantization level. AMD GPUs are not supported.

What is EXL2 quantization?+

EXL2 is ExLlamaV2's native quantization format. It allows mixed-precision quantization where different layers use different bit widths (2-8 bits). This preserves quality in sensitive layers while compressing others aggressively.

How does ExLlamaV2 compare to llama.cpp?+

ExLlamaV2 is GPU-only and uses custom CUDA kernels for maximum GPU throughput. llama.cpp supports CPU inference and broader hardware (Apple Silicon, AMD). ExLlamaV2 is faster on NVIDIA GPUs; llama.cpp is more portable.

Does ExLlamaV2 support batched inference?+

Yes. ExLlamaV2 supports batched inference with PagedAttention, allowing multiple concurrent requests to share GPU memory efficiently. This is useful for running a local API server.

Can I fine-tune models with ExLlamaV2?+

No. ExLlamaV2 is an inference-only engine. For fine-tuning, use tools like Unsloth, Axolotl, or the Hugging Face Trainer. After fine-tuning, quantize the model to EXL2 format for fast inference with ExLlamaV2.

Citations (3)
🙏

Source & Thanks

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets