Scripts2026年4月1日·1 分钟阅读

ExLlamaV2 — Fast Quantized LLM Inference

ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.

TO
TokRepo精选 · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

```bash pip install exllamav2 ``` ---
介绍
ExLlamaV2 is a high-performance inference library for running quantized LLMs on consumer NVIDIA GPUs. Optimized CUDA kernels for fast token generation, EXL2/GPTQ/HQQ quantization, PagedAttention, dynamic batching, speculative decoding, and a built-in chat server. Widely used as a backend in text-generation-webui. **Best for**: Users running quantized LLMs on consumer GPUs **Works with**: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf ---
## Key Features - Optimized CUDA kernels - EXL2, GPTQ, HQQ quantization - PagedAttention for memory efficiency - Dynamic batching and speculative decoding - Built-in chat server - text-generation-webui backend --- ### FAQ **Q: What is ExLlamaV2?** A: Fast quantized LLM inference. Optimized CUDA, EXL2/GPTQ/HQQ, PagedAttention. Consumer GPU. **Q: How do I install it?** A: pip install exllamav2. Requires NVIDIA GPU. ---
🙏

来源与感谢

> [turboderp/exllamav2](https://github.com/turboderp/exllamav2)

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产