Scripts2026年4月1日·1 分钟阅读
ExLlamaV2 — Fast Quantized LLM Inference
ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.
TO
TokRepo精选 · Community
快速使用
先拿来用,再决定要不要深挖
这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。
```bash
pip install exllamav2
```
---
介绍
ExLlamaV2 is a high-performance inference library for running quantized LLMs on consumer NVIDIA GPUs. Optimized CUDA kernels for fast token generation, EXL2/GPTQ/HQQ quantization, PagedAttention, dynamic batching, speculative decoding, and a built-in chat server. Widely used as a backend in text-generation-webui.
**Best for**: Users running quantized LLMs on consumer GPUs **Works with**: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf
---
🙏
来源与感谢
> [turboderp/exllamav2](https://github.com/turboderp/exllamav2)
讨论
登录后参与讨论。
还没有评论,来写第一条吧。
相关资产
Hoppscotch — Open-Source API Development Platform
Test APIs with a beautiful UI. REST, GraphQL, WebSocket, SSE, and gRPC. Self-hostable Postman alternative. 78K+ GitHub stars.
TokRepo精选
AFFiNE — Open-Source Notion Alternative
Docs, whiteboards, and databases in one privacy-first workspace. Local-first with real-time collaboration. 66K+ GitHub stars.
TokRepo精选
Uptime Kuma — Self-Hosted Uptime Monitoring
Monitor HTTP, TCP, DNS, Docker services with notifications to 90+ channels. Beautiful dashboard. 84K+ GitHub stars.
TokRepo精选