# vLLM — High-Throughput LLM Serving Engine

> vLLM is a high-throughput and memory-efficient LLM inference engine. 74.8K+ GitHub stars. PagedAttention, continuous batching, OpenAI-compatible API, multi-GPU serving. Apache 2.0.

## Install

Save as a script file and run:

## Quick Use

```bash
# Install
pip install vllm

# Serve a model with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct

# Or use in Python
python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct')
outputs = llm.generate(['Hello, who are you?'], SamplingParams(temperature=0.7, max_tokens=256))
print(outputs[0].outputs[0].text)
"
```

---

## Intro

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. With 74,800+ GitHub stars and Apache 2.0 license, vLLM introduces PagedAttention for efficient KV cache memory management, continuous request batching, and CUDA/HIP graph optimization. It supports multiple quantization methods (GPTQ, AWQ, INT4/8, FP8), distributed inference with tensor/pipeline parallelism, an OpenAI-compatible API server, and runs on NVIDIA, AMD, Intel, and TPU hardware.

**Best for**: Teams serving LLMs in production with high throughput and low latency requirements
**Works with**: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf
**Hardware**: NVIDIA, AMD, Intel, TPU, AWS Neuron

---

## Key Features

- **PagedAttention**: Efficient KV cache memory management for higher throughput
- **Continuous batching**: Process requests without waiting for batch completion
- **OpenAI-compatible API**: Drop-in replacement server for any OpenAI client
- **Multi-GPU serving**: Tensor, pipeline, data, and expert parallelism
- **Quantization**: GPTQ, AWQ, AutoRound, INT4/8, FP8 support
- **Prefix caching**: Reuse KV cache across requests with shared prefixes
- **Multi-LoRA**: Serve multiple LoRA adapters on one base model

---

### FAQ

**Q: What is vLLM?**
A: vLLM is an LLM serving engine with 74.8K+ stars featuring PagedAttention for efficient memory use, continuous batching, and an OpenAI-compatible API. Supports multi-GPU distributed inference. Apache 2.0.

**Q: How do I install vLLM?**
A: Run `pip install vllm`. Serve models with `vllm serve <model-name>` which starts an OpenAI-compatible API server.

---

## Source & Thanks

> Created by [UC Berkeley Sky Lab](https://github.com/vllm-project). Licensed under Apache 2.0.
> [vllm-project/vllm](https://github.com/vllm-project/vllm) — 74,800+ GitHub stars

---
Source: https://tokrepo.com/en/workflows/ca2016fb-173e-4cc4-aad3-749d66377e89
Author: Script Depot