# ExLlamaV2 — Fast Quantized LLM Inference

> ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.

## Install

Save as a script file and run:

## Quick Use

```bash
pip install exllamav2
```

---

## Intro

ExLlamaV2 is a high-performance inference library for running quantized LLMs on consumer NVIDIA GPUs. Optimized CUDA kernels for fast token generation, EXL2/GPTQ/HQQ quantization, PagedAttention, dynamic batching, speculative decoding, and a built-in chat server. Widely used as a backend in text-generation-webui.

**Best for**: Users running quantized LLMs on consumer GPUs
**Works with**: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf

---

## Key Features

- Optimized CUDA kernels
- EXL2, GPTQ, HQQ quantization
- PagedAttention for memory efficiency
- Dynamic batching and speculative decoding
- Built-in chat server
- text-generation-webui backend

---

### FAQ

**Q: What is ExLlamaV2?**
A: Fast quantized LLM inference. Optimized CUDA, EXL2/GPTQ/HQQ, PagedAttention. Consumer GPU.

**Q: How do I install it?**
A: pip install exllamav2. Requires NVIDIA GPU.

---

## Source & Thanks

> [turboderp/exllamav2](https://github.com/turboderp/exllamav2)

---
Source: https://tokrepo.com/en/workflows/556eded4-26f7-4c21-a701-b6c6a117852b
Author: TokRepo精选