Scripts2026年4月1日·1 分钟阅读

ExLlamaV2 — Fast Quantized LLM Inference

ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.

介绍

ExLlamaV2 is a high-performance inference library for running quantized LLMs on consumer NVIDIA GPUs. Optimized CUDA kernels for fast token generation, EXL2/GPTQ/HQQ quantization, PagedAttention, dynamic batching, speculative decoding, and a built-in chat server. Widely used as a backend in text-generation-webui.

Best for: Users running quantized LLMs on consumer GPUs Works with: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf


Key Features

  • Optimized CUDA kernels
  • EXL2, GPTQ, HQQ quantization
  • PagedAttention for memory efficiency
  • Dynamic batching and speculative decoding
  • Built-in chat server
  • text-generation-webui backend

FAQ

Q: What is ExLlamaV2? A: Fast quantized LLM inference. Optimized CUDA, EXL2/GPTQ/HQQ, PagedAttention. Consumer GPU.

Q: How do I install it? A: pip install exllamav2. Requires NVIDIA GPU.


🙏

来源与感谢

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产