Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsApr 1, 2026·1 min de lecture

ExLlamaV2 — Fast Quantized LLM Inference

ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.

Introduction

ExLlamaV2 is a high-performance inference library for running quantized LLMs on consumer NVIDIA GPUs. Optimized CUDA kernels for fast token generation, EXL2/GPTQ/HQQ quantization, PagedAttention, dynamic batching, speculative decoding, and a built-in chat server. Widely used as a backend in text-generation-webui.

Best for: Users running quantized LLMs on consumer GPUs Works with: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf


Key Features

  • Optimized CUDA kernels
  • EXL2, GPTQ, HQQ quantization
  • PagedAttention for memory efficiency
  • Dynamic batching and speculative decoding
  • Built-in chat server
  • text-generation-webui backend

FAQ

Q: What is ExLlamaV2? A: Fast quantized LLM inference. Optimized CUDA, EXL2/GPTQ/HQQ, PagedAttention. Consumer GPU.

Q: How do I install it? A: pip install exllamav2. Requires NVIDIA GPU.


🙏

Source et remerciements

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires