Is ExLlamaV2 — Fast Quantized LLM Inference free to use?

Yes. ExLlamaV2 — Fast Quantized LLM Inference is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install ExLlamaV2 — Fast Quantized LLM Inference?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ScriptsApr 1, 2026·1 min read

ExLlamaV2 — Fast Quantized LLM Inference

ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.

Script Depot · Community

TL;DR

ExLlamaV2 runs quantized LLMs on consumer GPUs using optimized CUDA kernels with EXL2, GPTQ, and HQQ formats.

§01

What it is

ExLlamaV2 is an inference engine for running quantized large language models on consumer NVIDIA GPUs. It uses optimized CUDA kernels to deliver high throughput with formats including EXL2, GPTQ, and HQQ. Features include PagedAttention and speculative decoding for additional speed.

ExLlamaV2 targets hobbyists and developers who want to run 7B-70B parameter models on hardware with 8-24GB VRAM. It extracts maximum performance from quantized models that other frameworks leave on the table.

The project is actively maintained and suitable for both individual developers and teams looking to integrate it into their existing toolchain. Documentation and community support are available for onboarding.

§02

How it saves time or tokens

ExLlamaV2's custom CUDA kernels process quantized weights faster than general-purpose frameworks. The EXL2 format allows per-layer quantization (mixing 2-bit and 6-bit) to balance quality and memory. PagedAttention reduces memory waste from KV cache, letting you run longer contexts. Speculative decoding speeds up generation by 1.5-2x for compatible models.

§03

How to use

Install ExLlamaV2 from pip or build from source with CUDA support.
Download a quantized model in EXL2 or GPTQ format from Hugging Face.
Load the model and run inference using the Python API or the built-in chat server.
Configure batch size, context length, and speculative decoding based on your GPU VRAM.

§04

Example

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
from exllamav2.generator import ExLlamaV2StreamingGenerator

# Load a quantized model
config = ExLlamaV2Config('models/Llama-3-8B-EXL2-4.0bpw/')
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)
model.load_autosplit(cache)

# Stream tokens
generator = ExLlamaV2StreamingGenerator(model, cache, model.tokenizer)
generator.set_stop_conditions([model.tokenizer.eos_token_id])

for token in generator.generate_simple('Explain transformers in 3 sentences.'):
    print(token, end='', flush=True)

§05

Related on TokRepo

Local LLM Runners — Compare ExLlamaV2 with other local inference engines.
AI Tools for Coding — Use locally-run models for AI-assisted coding.

§06

Common pitfalls

Choosing a model too large for your VRAM. A 70B model at 4-bit quantization needs about 35GB VRAM. Check the model card for VRAM requirements before downloading.
Using GPTQ when EXL2 is available. EXL2 offers better quality-per-bit through per-layer quantization. Prefer EXL2 models when they exist for your target model.
Not setting max_seq_len appropriately. Higher context lengths consume more VRAM for KV cache. Set it to the maximum you actually need, not the model's theoretical maximum.
Not reading the changelog before upgrading. Breaking changes between versions can cause unexpected failures in production. Pin your version and review release notes.

Frequently Asked Questions

What GPU do I need for ExLlamaV2?+

Any NVIDIA GPU with CUDA support works. For 7B models at 4-bit, you need at least 6GB VRAM. For 70B models, you need 24-48GB VRAM depending on quantization level. AMD GPUs are not supported.

What is EXL2 quantization?+

EXL2 is ExLlamaV2's native quantization format. It allows mixed-precision quantization where different layers use different bit widths (2-8 bits). This preserves quality in sensitive layers while compressing others aggressively.

How does ExLlamaV2 compare to llama.cpp?+

ExLlamaV2 is GPU-only and uses custom CUDA kernels for maximum GPU throughput. llama.cpp supports CPU inference and broader hardware (Apple Silicon, AMD). ExLlamaV2 is faster on NVIDIA GPUs; llama.cpp is more portable.

Does ExLlamaV2 support batched inference?+

Yes. ExLlamaV2 supports batched inference with PagedAttention, allowing multiple concurrent requests to share GPU memory efficiently. This is useful for running a local API server.

Can I fine-tune models with ExLlamaV2?+

No. ExLlamaV2 is an inference-only engine. For fine-tuning, use tools like Unsloth, Axolotl, or the Hugging Face Trainer. After fine-tuning, quantize the model to EXL2 format for fast inference with ExLlamaV2.

Citations (3)

ExLlamaV2 GitHub— Fast quantized LLM inference with custom CUDA kernels
ExLlamaV2 Wiki— EXL2 mixed-precision quantization format
vLLM PagedAttention Paper— PagedAttention for efficient KV cache management

Related on TokRepo

Local LLM runners AI coding tools Featured workflows

🙏

Source & Thanks

turboderp/exllamav2

Discussion

No comments yet. Be the first to share your thoughts.

ExLlamaV2 — Fast Quantized LLM Inference

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

Frequently Asked Questions

Citations (3)

Related on TokRepo

Source & Thanks

Discussion

Related Assets

Moodle — Open-Source Learning Management System

Sylius — Headless E-Commerce Framework on Symfony

Akaunting — Free Self-Hosted Accounting Software