# vLLM — High-Throughput LLM Serving Engine > vLLM is a high-throughput and memory-efficient LLM inference engine. 74.8K+ GitHub stars. PagedAttention, continuous batching, OpenAI-compatible API, multi-GPU serving. Apache 2.0. ## Install Save as a script file and run: ## Quick Use ```bash # Install pip install vllm # Serve a model with OpenAI-compatible API vllm serve meta-llama/Llama-3.1-8B-Instruct # Or use in Python python -c " from vllm import LLM, SamplingParams llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct') outputs = llm.generate(['Hello, who are you?'], SamplingParams(temperature=0.7, max_tokens=256)) print(outputs[0].outputs[0].text) " ``` --- ## Intro vLLM is a high-throughput and memory-efficient inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. With 74,800+ GitHub stars and Apache 2.0 license, vLLM introduces PagedAttention for efficient KV cache memory management, continuous request batching, and CUDA/HIP graph optimization. It supports multiple quantization methods (GPTQ, AWQ, INT4/8, FP8), distributed inference with tensor/pipeline parallelism, an OpenAI-compatible API server, and runs on NVIDIA, AMD, Intel, and TPU hardware. **Best for**: Teams serving LLMs in production with high throughput and low latency requirements **Works with**: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf **Hardware**: NVIDIA, AMD, Intel, TPU, AWS Neuron --- ## Key Features - **PagedAttention**: Efficient KV cache memory management for higher throughput - **Continuous batching**: Process requests without waiting for batch completion - **OpenAI-compatible API**: Drop-in replacement server for any OpenAI client - **Multi-GPU serving**: Tensor, pipeline, data, and expert parallelism - **Quantization**: GPTQ, AWQ, AutoRound, INT4/8, FP8 support - **Prefix caching**: Reuse KV cache across requests with shared prefixes - **Multi-LoRA**: Serve multiple LoRA adapters on one base model --- ### FAQ **Q: What is vLLM?** A: vLLM is an LLM serving engine with 74.8K+ stars featuring PagedAttention for efficient memory use, continuous batching, and an OpenAI-compatible API. Supports multi-GPU distributed inference. Apache 2.0. **Q: How do I install vLLM?** A: Run `pip install vllm`. Serve models with `vllm serve ` which starts an OpenAI-compatible API server. --- ## Source & Thanks > Created by [UC Berkeley Sky Lab](https://github.com/vllm-project). Licensed under Apache 2.0. > [vllm-project/vllm](https://github.com/vllm-project/vllm) — 74,800+ GitHub stars --- Source: https://tokrepo.com/en/workflows/ca2016fb-173e-4cc4-aad3-749d66377e89 Author: Script Depot