Skills2026年3月31日·1 分钟阅读

vLLM — High-Throughput LLM Serving Engine

vLLM is a high-throughput and memory-efficient LLM inference engine. 74.8K+ GitHub stars. PagedAttention, continuous batching, OpenAI-compatible API, multi-GPU serving. Apache 2.0.

Script Depot · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Established

入口

vLLM — High-Throughput LLM Serving Engine

直接安装命令

npx -y tokrepo@latest install ca2016fb-173e-4cc4-aad3-749d66377e89 --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

vLLM serves LLMs with high throughput using PagedAttention and continuous batching. OpenAI-compatible API, multi-GPU support. The standard for production LLM serving.

§01

What it is

vLLM is a high-throughput, memory-efficient inference engine for serving large language models. Its key innovation is PagedAttention, which manages attention key-value caches like virtual memory pages, dramatically reducing memory waste and enabling higher concurrent request handling. It provides an OpenAI-compatible API server, continuous batching, multi-GPU tensor parallelism, and support for a wide range of model architectures.

It targets teams deploying LLMs to production who need maximum throughput and minimum latency per dollar of compute.

§02

How it saves time or tokens

vLLM's PagedAttention eliminates up to 90% of the memory waste in traditional attention KV cache management. This means you serve more concurrent users on the same GPU hardware. Continuous batching ensures the GPU stays saturated -- new requests are added to the batch immediately rather than waiting for the current batch to complete. The result is 2-4x higher throughput compared to naive serving approaches.

§03

How to use

Install:

pip install vllm

Serve a model with OpenAI-compatible API:

vllm serve meta-llama/Llama-3.1-8B-Instruct

Query like any OpenAI API:

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='dummy')

response = client.chat.completions.create(
    model='meta-llama/Llama-3.1-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Explain PagedAttention briefly.'}],
)
print(response.choices[0].message.content)

For multi-GPU:

vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 4

§04

Example

Feature	vLLM
Throughput	2-4x vs naive serving
Batching	Continuous (no waiting)
KV cache	PagedAttention (90% less waste)
API	OpenAI-compatible
Multi-GPU	Tensor + pipeline parallelism
Quantization	GPTQ, AWQ, SqueezeLLM, FP8
License	Apache 2.0

§05

Related on TokRepo

Local LLM: vLLM -- vLLM deep-dive on TokRepo
Local LLM tools -- all local LLM tools

§06

Common pitfalls

vLLM requires NVIDIA GPUs with CUDA. AMD ROCm support exists but is less mature. Apple Silicon and CPU-only inference are not supported -- use llama.cpp or Ollama for those platforms.
Model loading requires enough GPU memory to hold the model weights plus KV cache. Check memory requirements before deploying. Quantized models reduce the memory footprint.
The OpenAI-compatible API covers most endpoints but may not support every feature of the official OpenAI API. Test your specific use case.

常见问题

What is PagedAttention?+

PagedAttention manages the attention key-value cache like an operating system manages virtual memory. Instead of pre-allocating a fixed block of memory per request, it allocates memory in small pages as needed. This eliminates internal fragmentation and allows more concurrent requests on the same GPU. It is vLLM's core innovation.

How does vLLM compare to llama.cpp?+

vLLM is optimized for server-side GPU inference with high throughput and concurrent request handling. llama.cpp is optimized for local inference on diverse hardware including CPU and Apple Silicon. Use vLLM for production serving on GPU servers; use llama.cpp for local development or CPU-based inference.

Does vLLM support streaming?+

Yes. vLLM supports streaming responses through its OpenAI-compatible API. Tokens are streamed as they are generated, providing the same streaming experience as the OpenAI API. This is essential for interactive chat applications.

Can vLLM serve multiple models?+

A single vLLM instance serves one model. To serve multiple models, run multiple vLLM instances on different ports or GPUs. Use a load balancer or API gateway to route requests to the appropriate model instance.

Is vLLM free?+

Yes. vLLM is open source under the Apache 2.0 license and free for all uses. You pay only for the GPU compute you use. There are no licensing fees. vLLM is widely deployed in production by companies of all sizes.

引用来源 (3)

vLLM GitHub— vLLM repository
vLLM Docs— vLLM documentation
PagedAttention Paper (arXiv)— PagedAttention paper

🙏

来源与感谢

Created by UC Berkeley Sky Lab. Licensed under Apache 2.0. vllm-project/vllm — 74,800+ GitHub stars

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

vLLM — High-Throughput LLM Serving Engine

Agent 可直接安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

nano-vllm — Lightweight LLM Serving Engine

Varnish Cache — High-Performance HTTP Reverse Proxy and Accelerator

Liger-Kernel — Efficient GPU Kernels for LLM Training

FlashInfer — Kernel Library for LLM Serving