vLLM — High-Throughput LLM Serving Engine
vLLM is a high-throughput and memory-efficient LLM inference engine. 74.8K+ GitHub stars. PagedAttention, continuous batching, OpenAI-compatible API, multi-GPU serving. Apache 2.0.
Instalación lista para agent
Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.
npx -y tokrepo@latest install ca2016fb-173e-4cc4-aad3-749d66377e89 --target codexEjecutar después de confirmar el plan con dry-run.
What it is
vLLM is a high-throughput, memory-efficient inference engine for serving large language models. Its key innovation is PagedAttention, which manages attention key-value caches like virtual memory pages, dramatically reducing memory waste and enabling higher concurrent request handling. It provides an OpenAI-compatible API server, continuous batching, multi-GPU tensor parallelism, and support for a wide range of model architectures.
It targets teams deploying LLMs to production who need maximum throughput and minimum latency per dollar of compute.
How it saves time or tokens
vLLM's PagedAttention eliminates up to 90% of the memory waste in traditional attention KV cache management. This means you serve more concurrent users on the same GPU hardware. Continuous batching ensures the GPU stays saturated -- new requests are added to the batch immediately rather than waiting for the current batch to complete. The result is 2-4x higher throughput compared to naive serving approaches.
How to use
- Install:
pip install vllm
- Serve a model with OpenAI-compatible API:
vllm serve meta-llama/Llama-3.1-8B-Instruct
- Query like any OpenAI API:
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')
response = client.chat.completions.create(
model='meta-llama/Llama-3.1-8B-Instruct',
messages=[{'role': 'user', 'content': 'Explain PagedAttention briefly.'}],
)
print(response.choices[0].message.content)
- For multi-GPU:
vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 4
Example
| Feature | vLLM |
|---|---|
| Throughput | 2-4x vs naive serving |
| Batching | Continuous (no waiting) |
| KV cache | PagedAttention (90% less waste) |
| API | OpenAI-compatible |
| Multi-GPU | Tensor + pipeline parallelism |
| Quantization | GPTQ, AWQ, SqueezeLLM, FP8 |
| License | Apache 2.0 |
Related on TokRepo
- Local LLM: vLLM -- vLLM deep-dive on TokRepo
- Local LLM tools -- all local LLM tools
Common pitfalls
- vLLM requires NVIDIA GPUs with CUDA. AMD ROCm support exists but is less mature. Apple Silicon and CPU-only inference are not supported -- use llama.cpp or Ollama for those platforms.
- Model loading requires enough GPU memory to hold the model weights plus KV cache. Check memory requirements before deploying. Quantized models reduce the memory footprint.
- The OpenAI-compatible API covers most endpoints but may not support every feature of the official OpenAI API. Test your specific use case.
Preguntas frecuentes
PagedAttention manages the attention key-value cache like an operating system manages virtual memory. Instead of pre-allocating a fixed block of memory per request, it allocates memory in small pages as needed. This eliminates internal fragmentation and allows more concurrent requests on the same GPU. It is vLLM's core innovation.
vLLM is optimized for server-side GPU inference with high throughput and concurrent request handling. llama.cpp is optimized for local inference on diverse hardware including CPU and Apple Silicon. Use vLLM for production serving on GPU servers; use llama.cpp for local development or CPU-based inference.
Yes. vLLM supports streaming responses through its OpenAI-compatible API. Tokens are streamed as they are generated, providing the same streaming experience as the OpenAI API. This is essential for interactive chat applications.
A single vLLM instance serves one model. To serve multiple models, run multiple vLLM instances on different ports or GPUs. Use a load balancer or API gateway to route requests to the appropriate model instance.
Yes. vLLM is open source under the Apache 2.0 license and free for all uses. You pay only for the GPU compute you use. There are no licensing fees. vLLM is widely deployed in production by companies of all sizes.
Referencias (3)
- vLLM GitHub— vLLM repository
- vLLM Docs— vLLM documentation
- PagedAttention Paper (arXiv)— PagedAttention paper
Relacionados en TokRepo
Fuente y agradecimientos
Created by UC Berkeley Sky Lab. Licensed under Apache 2.0. vllm-project/vllm — 74,800+ GitHub stars
Discusión
Activos relacionados
nano-vllm — Lightweight LLM Serving Engine
nano-vllm is a minimal, educational, and performant LLM inference engine that reimplements core vLLM concepts in clean Python for easy understanding and extension.
Varnish Cache — High-Performance HTTP Reverse Proxy and Accelerator
An open-source HTTP reverse proxy and caching engine designed to accelerate web applications by serving content from memory at high throughput.
Liger-Kernel — Efficient GPU Kernels for LLM Training
Liger-Kernel provides optimized Triton kernels for LLM training that reduce GPU memory usage and improve throughput, serving as drop-in replacements for standard HuggingFace Transformers layers.
FlashInfer — Kernel Library for LLM Serving
High-performance CUDA kernel library providing optimized attention, decoding, and prefill operations for LLM inference engines like vLLM and SGLang.