# ExLlamaV2 — Fast Quantized LLM Inference > ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding. ## Install Save as a script file and run: ## Quick Use ```bash pip install exllamav2 ``` --- ## Intro ExLlamaV2 is a high-performance inference library for running quantized LLMs on consumer NVIDIA GPUs. Optimized CUDA kernels for fast token generation, EXL2/GPTQ/HQQ quantization, PagedAttention, dynamic batching, speculative decoding, and a built-in chat server. Widely used as a backend in text-generation-webui. **Best for**: Users running quantized LLMs on consumer GPUs **Works with**: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf --- ## Key Features - Optimized CUDA kernels - EXL2, GPTQ, HQQ quantization - PagedAttention for memory efficiency - Dynamic batching and speculative decoding - Built-in chat server - text-generation-webui backend --- ### FAQ **Q: What is ExLlamaV2?** A: Fast quantized LLM inference. Optimized CUDA, EXL2/GPTQ/HQQ, PagedAttention. Consumer GPU. **Q: How do I install it?** A: pip install exllamav2. Requires NVIDIA GPU. --- ## Source & Thanks > [turboderp/exllamav2](https://github.com/turboderp/exllamav2) --- Source: https://tokrepo.com/en/workflows/556eded4-26f7-4c21-a701-b6c6a117852b Author: TokRepo精选