Key Features
- Optimized CUDA kernels
- EXL2, GPTQ, HQQ quantization
- PagedAttention for memory efficiency
- Dynamic batching and speculative decoding
- Built-in chat server
- text-generation-webui backend
FAQ
Q: What is ExLlamaV2? A: Fast quantized LLM inference. Optimized CUDA, EXL2/GPTQ/HQQ, PagedAttention. Consumer GPU.
Q: How do I install it? A: pip install exllamav2. Requires NVIDIA GPU.