ScriptsMar 31, 2026·2 min read

vLLM — High-Throughput LLM Serving Engine

vLLM is a high-throughput and memory-efficient LLM inference engine. 74.8K+ GitHub stars. PagedAttention, continuous batching, OpenAI-compatible API, multi-GPU serving. Apache 2.0.

TO
TokRepo精选 · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

# Install
pip install vllm

# Serve a model with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct

# Or use in Python
python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct')
outputs = llm.generate(['Hello, who are you?'], SamplingParams(temperature=0.7, max_tokens=256))
print(outputs[0].outputs[0].text)
"

Intro

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. With 74,800+ GitHub stars and Apache 2.0 license, vLLM introduces PagedAttention for efficient KV cache memory management, continuous request batching, and CUDA/HIP graph optimization. It supports multiple quantization methods (GPTQ, AWQ, INT4/8, FP8), distributed inference with tensor/pipeline parallelism, an OpenAI-compatible API server, and runs on NVIDIA, AMD, Intel, and TPU hardware.

Best for: Teams serving LLMs in production with high throughput and low latency requirements Works with: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf Hardware: NVIDIA, AMD, Intel, TPU, AWS Neuron


Key Features

  • PagedAttention: Efficient KV cache memory management for higher throughput
  • Continuous batching: Process requests without waiting for batch completion
  • OpenAI-compatible API: Drop-in replacement server for any OpenAI client
  • Multi-GPU serving: Tensor, pipeline, data, and expert parallelism
  • Quantization: GPTQ, AWQ, AutoRound, INT4/8, FP8 support
  • Prefix caching: Reuse KV cache across requests with shared prefixes
  • Multi-LoRA: Serve multiple LoRA adapters on one base model

FAQ

Q: What is vLLM? A: vLLM is an LLM serving engine with 74.8K+ stars featuring PagedAttention for efficient memory use, continuous batching, and an OpenAI-compatible API. Supports multi-GPU distributed inference. Apache 2.0.

Q: How do I install vLLM? A: Run pip install vllm. Serve models with vllm serve <model-name> which starts an OpenAI-compatible API server.


🙏

Source & Thanks

Created by UC Berkeley Sky Lab. Licensed under Apache 2.0. vllm-project/vllm — 74,800+ GitHub stars

Related Assets