Cette page est affichée en anglais. Une traduction française est en cours.
MCP ConfigsApr 2, 2026·3 min de lecture

LMDeploy — High-Performance LLM Deployment Toolkit

Deploy and serve LLMs with 1.8x higher throughput than vLLM. 4-bit quantization, OpenAI-compatible API. By InternLM. 7.7K+ stars.

Introduction

LMDeploy is a production-grade toolkit with 7,700+ GitHub stars for compressing, deploying, and serving large language models. Developed by the InternLM team at Shanghai AI Laboratory, its TurboMind engine delivers up to 1.8x higher throughput than vLLM with persistent batching, blocked KV cache, and optimized CUDA kernels. It supports 4-bit quantization (AWQ, GPTQ), tensor parallelism across GPUs, and an OpenAI-compatible API — making it a drop-in replacement for any OpenAI-based application. Supports LLMs (Llama 3, Qwen 2, DeepSeek V3) and VLMs (InternVL, LLaVA).

Works with: NVIDIA GPUs, Huawei Ascend, Llama 3, Qwen 2, DeepSeek, InternLM, Mistral, InternVL, LLaVA. Best for teams deploying LLMs in production who need maximum throughput per GPU dollar. Setup time: under 5 minutes.


LMDeploy Architecture & Performance

TurboMind Engine

The high-performance inference backend:

Feature Benefit
Persistent Batching Continuously processes requests without batch boundaries
Blocked KV Cache Memory-efficient attention cache, no fragmentation
Paged Attention Dynamic memory allocation like OS virtual memory
Optimized CUDA Kernels Custom kernels for attention, GEMM, and rotary embeddings
Tensor Parallelism Distribute model across multiple GPUs

Performance Benchmarks

Tested on Llama 3 8B (A100 80GB):

Metric vLLM LMDeploy Improvement
Throughput (tok/s) 3,200 5,800 1.8x
First token latency 45ms 38ms 16% faster
Memory usage 24GB 22GB 8% less

Quantization

Reduce VRAM usage with minimal quality loss:

# AWQ 4-bit quantization (recommended)
lmdeploy lite auto_awq meta-llama/Meta-Llama-3-8B-Instruct \
  --work-dir llama3-8b-awq4

# Serve the quantized model
lmdeploy serve api_server llama3-8b-awq4 \
  --model-format awq \
  --server-port 23333
Quantization VRAM Quality (perplexity)
FP16 16 GB 5.12 (baseline)
AWQ W4A16 8 GB 5.18 (+1.2%)
GPTQ W4A16 8 GB 5.21 (+1.8%)
KV Cache INT8 -30% KV Negligible loss

OpenAI-Compatible API

Drop-in replacement for OpenAI applications:

from openai import OpenAI

# Point to LMDeploy server
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:23333/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    temperature=0.7,
    max_tokens=500,
)
print(response.choices[0].message.content)

Works with LangChain, LlamaIndex, and any OpenAI SDK client.

Multi-GPU Tensor Parallelism

# Serve a 70B model across 4 GPUs
lmdeploy serve api_server meta-llama/Meta-Llama-3-70B-Instruct \
  --tp 4 \
  --server-port 23333

Vision-Language Model Support

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline("OpenGVLab/InternVL2-8B")
image = load_image("photo.jpg")
response = pipe([{"role": "user", "content": [image, "Describe this image"]}])
print(response[0].text)

Supported VLMs: InternVL 2, LLaVA, CogVLM, MiniCPM-V.

Supported Models

Family Models
Llama Llama 3, Llama 3.1, Llama 3.2, CodeLlama
Qwen Qwen 2, Qwen 2.5, Qwen-VL
DeepSeek DeepSeek V3, DeepSeek R1, DeepSeek Coder
InternLM InternLM 2, InternLM 2.5
Mistral Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
Others Gemma, Phi-3, Yi, Baichuan, ChatGLM

FAQ

Q: What is LMDeploy? A: LMDeploy is a production-grade LLM deployment toolkit with 7,700+ GitHub stars, featuring TurboMind engine for 1.8x higher throughput than vLLM, 4-bit quantization, tensor parallelism, and OpenAI-compatible API serving.

Q: When should I use LMDeploy instead of vLLM? A: Use LMDeploy when you need maximum throughput per GPU, especially with quantized models. LMDeploy's TurboMind engine consistently benchmarks 1.5-1.8x faster than vLLM. vLLM has broader community adoption; LMDeploy has better raw performance.

Q: Is LMDeploy free? A: Yes, open-source under Apache-2.0.


🙏

Source et remerciements

Created by InternLM (Shanghai AI Laboratory). Licensed under Apache-2.0.

lmdeploy — ⭐ 7,700+

Thanks to the InternLM team for pushing the boundaries of LLM serving performance.

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.