LMDeploy Architecture & Performance
TurboMind Engine
The high-performance inference backend:
| Feature | Benefit |
|---|---|
| Persistent Batching | Continuously processes requests without batch boundaries |
| Blocked KV Cache | Memory-efficient attention cache, no fragmentation |
| Paged Attention | Dynamic memory allocation like OS virtual memory |
| Optimized CUDA Kernels | Custom kernels for attention, GEMM, and rotary embeddings |
| Tensor Parallelism | Distribute model across multiple GPUs |
Performance Benchmarks
Tested on Llama 3 8B (A100 80GB):
| Metric | vLLM | LMDeploy | Improvement |
|---|---|---|---|
| Throughput (tok/s) | 3,200 | 5,800 | 1.8x |
| First token latency | 45ms | 38ms | 16% faster |
| Memory usage | 24GB | 22GB | 8% less |
Quantization
Reduce VRAM usage with minimal quality loss:
# AWQ 4-bit quantization (recommended)
lmdeploy lite auto_awq meta-llama/Meta-Llama-3-8B-Instruct \
--work-dir llama3-8b-awq4
# Serve the quantized model
lmdeploy serve api_server llama3-8b-awq4 \
--model-format awq \
--server-port 23333| Quantization | VRAM | Quality (perplexity) |
|---|---|---|
| FP16 | 16 GB | 5.12 (baseline) |
| AWQ W4A16 | 8 GB | 5.18 (+1.2%) |
| GPTQ W4A16 | 8 GB | 5.21 (+1.8%) |
| KV Cache INT8 | -30% KV | Negligible loss |
OpenAI-Compatible API
Drop-in replacement for OpenAI applications:
from openai import OpenAI
# Point to LMDeploy server
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:23333/v1"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Explain quantum computing"}],
temperature=0.7,
max_tokens=500,
)
print(response.choices[0].message.content)Works with LangChain, LlamaIndex, and any OpenAI SDK client.
Multi-GPU Tensor Parallelism
# Serve a 70B model across 4 GPUs
lmdeploy serve api_server meta-llama/Meta-Llama-3-70B-Instruct \
--tp 4 \
--server-port 23333Vision-Language Model Support
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline("OpenGVLab/InternVL2-8B")
image = load_image("photo.jpg")
response = pipe([{"role": "user", "content": [image, "Describe this image"]}])
print(response[0].text)Supported VLMs: InternVL 2, LLaVA, CogVLM, MiniCPM-V.
Supported Models
| Family | Models |
|---|---|
| Llama | Llama 3, Llama 3.1, Llama 3.2, CodeLlama |
| Qwen | Qwen 2, Qwen 2.5, Qwen-VL |
| DeepSeek | DeepSeek V3, DeepSeek R1, DeepSeek Coder |
| InternLM | InternLM 2, InternLM 2.5 |
| Mistral | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B |
| Others | Gemma, Phi-3, Yi, Baichuan, ChatGLM |
FAQ
Q: What is LMDeploy? A: LMDeploy is a production-grade LLM deployment toolkit with 7,700+ GitHub stars, featuring TurboMind engine for 1.8x higher throughput than vLLM, 4-bit quantization, tensor parallelism, and OpenAI-compatible API serving.
Q: When should I use LMDeploy instead of vLLM? A: Use LMDeploy when you need maximum throughput per GPU, especially with quantized models. LMDeploy's TurboMind engine consistently benchmarks 1.5-1.8x faster than vLLM. vLLM has broader community adoption; LMDeploy has better raw performance.
Q: Is LMDeploy free? A: Yes, open-source under Apache-2.0.