Key Features
- PagedAttention: Efficient KV cache memory management for higher throughput
- Continuous batching: Process requests without waiting for batch completion
- OpenAI-compatible API: Drop-in replacement server for any OpenAI client
- Multi-GPU serving: Tensor, pipeline, data, and expert parallelism
- Quantization: GPTQ, AWQ, AutoRound, INT4/8, FP8 support
- Prefix caching: Reuse KV cache across requests with shared prefixes
- Multi-LoRA: Serve multiple LoRA adapters on one base model
FAQ
Q: What is vLLM? A: vLLM is an LLM serving engine with 74.8K+ stars featuring PagedAttention for efficient memory use, continuous batching, and an OpenAI-compatible API. Supports multi-GPU distributed inference. Apache 2.0.
Q: How do I install vLLM?
A: Run pip install vllm. Serve models with vllm serve <model-name> which starts an OpenAI-compatible API server.