Is ExLlamaV2 — Fast Quantized LLM Inference free to use?

Yes. ExLlamaV2 — Fast Quantized LLM Inference is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install ExLlamaV2 — Fast Quantized LLM Inference?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Scripts2026年4月1日·1 分钟阅读

ExLlamaV2 — Fast Quantized LLM Inference

Name: ExLlamaV2 — Fast Quantized LLM Inference
Author: Script Depot

ExLlamaV2 runs quantized LLMs on consumer GPUs with optimized CUDA kernels. EXL2/GPTQ/HQQ, PagedAttention, speculative decoding.

Script Depot · Community

介绍

ExLlamaV2 is a high-performance inference library for running quantized LLMs on consumer NVIDIA GPUs. Optimized CUDA kernels for fast token generation, EXL2/GPTQ/HQQ quantization, PagedAttention, dynamic batching, speculative decoding, and a built-in chat server. Widely used as a backend in text-generation-webui.

Best for: Users running quantized LLMs on consumer GPUs Works with: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf

Key Features

Optimized CUDA kernels
EXL2, GPTQ, HQQ quantization
PagedAttention for memory efficiency
Dynamic batching and speculative decoding
Built-in chat server
text-generation-webui backend

FAQ

Q: What is ExLlamaV2? A: Fast quantized LLM inference. Optimized CUDA, EXL2/GPTQ/HQQ, PagedAttention. Consumer GPU.

Q: How do I install it? A: pip install exllamav2. Requires NVIDIA GPU.

🙏

来源与感谢

turboderp/exllamav2

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

ExLlamaV2 — Fast Quantized LLM Inference

Key Features

FAQ

来源与感谢

讨论

相关资产

doctest — The Fastest Feature-Rich C++ Testing Framework

Chai — BDD/TDD Assertion Library for Node.js

Supertest — HTTP Assertion Library for Node.js APIs