Local LLM

vLLM — 高吞吐 GPU 推理服务（生产规模）

vLLM 是面向规模化部署的开源推理引擎。PagedAttention、连续批处理、前缀缓存让它成为 GPU 生产多用户服务的最高吞吐选项。

Why vLLM

vLLM came out of UC Berkeley in 2023 with a specific insight: standard attention implementations waste GPU memory at serving time because they allocate full sequence-length buffers per request. PagedAttention chunks the KV cache into pages the engine can share and reclaim dynamically, dramatically increasing effective batch size on the same GPU. In 2026 PagedAttention is standard across most high-performance inference engines, but vLLM remains the most mature and widely deployed implementation.

The other critical feature is continuous batching — new requests stream into an active batch rather than waiting for the slowest current request to finish. Practical effect: latency stays low under load while throughput scales near-linearly with GPU capacity. For SaaS products serving many users, this alone can cut your GPU bill in half compared to naive batching.

Who should use vLLM: teams serving >10 concurrent users on GPU hardware. Below that threshold, Ollama or llama.cpp server get you there with less complexity. Above it, vLLM (or its derivatives like SGLang, TensorRT-LLM) is the default. Deployment is Python-heavy with CUDA dependencies — set aside real ops time.

Quick Start — Serve a Model with OpenAI API

Set --tensor-parallel-size N to shard across N GPUs on one node. --pipeline-parallel-size for multi-node. Use --quantization awq/gptq for pre-quantized weights, or --kv-cache-dtype fp8 on Hopper+ GPUs to double effective KV cache. Watch the metrics endpoint at :8000/metrics for request queue depth and GPU utilization.

# pip install vllm  (requires CUDA-capable GPU)

# 1. Start the OpenAI-compatible server
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# Server listens on http://0.0.0.0:8000/v1 with OpenAI-shape endpoints:
#   POST /v1/chat/completions
#   POST /v1/completions
#   POST /v1/embeddings
#   GET  /v1/models

# 2. Use any OpenAI SDK against it
python - <<'PY'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")
r = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role":"user","content":"What problem does PagedAttention solve?"}],
)
print(r.choices[0].message.content)
PY

# 3. Production deployment: Docker image
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --tensor-parallel-size 1     # increase for multi-GPU

核心能力

PagedAttention

Chunked KV cache lets the engine share memory across requests — massively increases effective batch size. The headline feature that put vLLM on the map.

Continuous batching

New requests join active batches mid-flight; finished sequences free slots immediately. Keeps GPU utilization near 100% under realistic traffic.

Prefix caching

Reuses KV-cache prefixes across requests sharing the same system prompt or few-shot examples. Large speedup for agent workloads with repetitive prompts.

Tensor + pipeline parallelism

Shard a single model across multiple GPUs on one node (tensor) or across nodes (pipeline). Essential for 70B+ models or extreme throughput.

Broad model support

Llama 3.x, Qwen 2.5, Mistral, DeepSeek, Gemma, Phi, Command-R, multi-modal (LLaVA, Qwen-VL, Llama 3.2 vision). New architectures land quickly after release.

OpenAI-compatible API

chat/completions, completions, embeddings all match the OpenAI spec. Drop-in replacement; also works with LangChain, LiteLLM, and the OpenAI SDK with a base_url override.

对比

	Throughput	Setup Complexity	Model Size	Best For
vLLM本工具	Highest open-source (GPU)	Medium-high	7B-671B with multi-GPU	Production multi-user GPU serving
llama.cpp server	Good (CPU+GPU)	Medium	Up to host memory	Single-machine, any hardware
Ollama	Good (llama.cpp)	Very low	Up to host memory	Small teams, desktop
TensorRT-LLM	Highest on NVIDIA	High	7B-671B	Maximum throughput on NVIDIA

实际用例

01. SaaS products with high concurrency

Chatbots, copilots, or agent APIs serving dozens to thousands of concurrent users on GPU hardware. vLLM’s batching maintains latency at scale — critical for product-quality experiences.

02. Long-context agents

Agents that send large prompts (RAG, long chats, code context). PagedAttention + prefix caching make long-context serving economical at scale.

03. Internal AI infra platform

Platform teams exposing a shared LLM endpoint behind LiteLLM or Portkey. vLLM is the standard engine choice under the hood — reliable, fast, and widely supported.

价格与许可

vLLM: Apache 2.0 open source. Free to self-host. Backed by Neural Magic (Red Hat) plus a broad contributor community.

Hardware cost: one or more GPUs. 24GB VRAM covers 7-14B at fp16 or 30-34B at 4-bit. 48-80GB covers 70B 4-bit. Multi-node setups for the biggest models.

Operational cost: higher than Ollama/LM Studio. Python + CUDA environment, careful memory tuning, monitoring of request queues. Budget real DevOps time to run it well.

常见问题

Do I need vLLM instead of Ollama?+

Only if you serve more than ~5-10 concurrent users or need latency SLOs under load. For desktop use, small-team LLMs, or dev machines, Ollama is strictly simpler. vLLM pays off at scale.

Can vLLM run without a GPU?+

No — vLLM requires CUDA (or ROCm / Intel GPU with experimental support). For CPU-only inference use llama.cpp server, Ollama, or LocalAI. vLLM is optimized for GPU memory management; it does not make sense on CPU.

How does vLLM compare to TensorRT-LLM?+

TensorRT-LLM is NVIDIA’s proprietary inference stack — highest throughput on NVIDIA hardware, but harder to operate, tied to NVIDIA, and less portable across models. vLLM is OSS, multi-architecture, and catches up on throughput for most models. Choose TensorRT when you need every last token/sec on NVIDIA; vLLM elsewhere.

Does vLLM support tool calls?+

Yes — via the OpenAI chat completions API tool-calling parameters, on models with tool-call-capable weights (Llama 3.1+, Qwen 2.5, Mistral v0.3+). Tool grammar support via --enable-auto-tool-choice.

Multi-node deployment?+

Yes. Use --pipeline-parallel-size across nodes plus --tensor-parallel-size within each node. Ray handles the orchestration. Not trivial to set up; budget time and expect to tune for your workload.

What about SGLang / LMDeploy / TGI?+

All are modern inference engines competing with vLLM. SGLang adds structured outputs and constrained decoding, LMDeploy is strong on Nvidia + quantization, HuggingFace TGI is simpler but now lagging on features. vLLM remains the most general-purpose; evaluate alternatives only for specific needs.