Scripts2026年3月31日·1 分钟阅读

vLLM — High-Throughput LLM Serving Engine

vLLM is a high-throughput and memory-efficient LLM inference engine. 74.8K+ GitHub stars. PagedAttention, continuous batching, OpenAI-compatible API, multi-GPU serving. Apache 2.0.

TO
TokRepo精选 · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

# Install
pip install vllm

# Serve a model with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct

# Or use in Python
python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct')
outputs = llm.generate(['Hello, who are you?'], SamplingParams(temperature=0.7, max_tokens=256))
print(outputs[0].outputs[0].text)
"

介绍

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. With 74,800+ GitHub stars and Apache 2.0 license, vLLM introduces PagedAttention for efficient KV cache memory management, continuous request batching, and CUDA/HIP graph optimization. It supports multiple quantization methods (GPTQ, AWQ, INT4/8, FP8), distributed inference with tensor/pipeline parallelism, an OpenAI-compatible API server, and runs on NVIDIA, AMD, Intel, and TPU hardware.

Best for: Teams serving LLMs in production with high throughput and low latency requirements Works with: Claude Code, OpenAI Codex, Cursor, Gemini CLI, Windsurf Hardware: NVIDIA, AMD, Intel, TPU, AWS Neuron


Key Features

  • PagedAttention: Efficient KV cache memory management for higher throughput
  • Continuous batching: Process requests without waiting for batch completion
  • OpenAI-compatible API: Drop-in replacement server for any OpenAI client
  • Multi-GPU serving: Tensor, pipeline, data, and expert parallelism
  • Quantization: GPTQ, AWQ, AutoRound, INT4/8, FP8 support
  • Prefix caching: Reuse KV cache across requests with shared prefixes
  • Multi-LoRA: Serve multiple LoRA adapters on one base model

FAQ

Q: What is vLLM? A: vLLM is an LLM serving engine with 74.8K+ stars featuring PagedAttention for efficient memory use, continuous batching, and an OpenAI-compatible API. Supports multi-GPU distributed inference. Apache 2.0.

Q: How do I install vLLM? A: Run pip install vllm. Serve models with vllm serve <model-name> which starts an OpenAI-compatible API server.


🙏

来源与感谢

Created by UC Berkeley Sky Lab. Licensed under Apache 2.0. vllm-project/vllm — 74,800+ GitHub stars

相关资产