What is LMDeploy — High-Performance LLM Deployment Toolkit?

Deploy and serve LLMs with 1.8x higher throughput than vLLM. 4-bit quantization, OpenAI-compatible API. By InternLM. 7.7K+ stars.

Is LMDeploy — High-Performance LLM Deployment Toolkit free to use?

Yes. LMDeploy — High-Performance LLM Deployment Toolkit is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install LMDeploy — High-Performance LLM Deployment Toolkit?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LMDeploy — High-Performance LLM Deployment Toolkit

LMDeploy is a production-grade toolkit with 7,700+ GitHub stars for compressing, deploying, and serving large language models. Developed by the InternLM team at Shanghai AI Laboratory, its TurboMind engine delivers up to 1.8x higher throughput than vLLM with persistent batching, blocked KV cache, and optimized CUDA kernels. It supports 4-bit quantization (AWQ, GPTQ), tensor parallelism across GPUs, and an OpenAI-compatible API — making it a drop-in replacement for any OpenAI-based application. Supports LLMs (Llama 3, Qwen 2, DeepSeek V3) and VLMs (InternVL, LLaVA).

Works with: NVIDIA GPUs, Huawei Ascend, Llama 3, Qwen 2, DeepSeek, InternLM, Mistral, InternVL, LLaVA. Best for teams deploying LLMs in production who need maximum throughput per GPU dollar. Setup time: under 5 minutes.

LMDeploy Architecture & Performance

TurboMind Engine

The high-performance inference backend:

Feature	Benefit
Persistent Batching	Continuously processes requests without batch boundaries
Blocked KV Cache	Memory-efficient attention cache, no fragmentation
Paged Attention	Dynamic memory allocation like OS virtual memory
Optimized CUDA Kernels	Custom kernels for attention, GEMM, and rotary embeddings
Tensor Parallelism	Distribute model across multiple GPUs

Performance Benchmarks

Tested on Llama 3 8B (A100 80GB):

Metric	vLLM	LMDeploy	Improvement
Throughput (tok/s)	3,200	5,800	1.8x
First token latency	45ms	38ms	16% faster
Memory usage	24GB	22GB	8% less

Quantization

Reduce VRAM usage with minimal quality loss:

# AWQ 4-bit quantization (recommended)
lmdeploy lite auto_awq meta-llama/Meta-Llama-3-8B-Instruct \
  --work-dir llama3-8b-awq4

# Serve the quantized model
lmdeploy serve api_server llama3-8b-awq4 \
  --model-format awq \
  --server-port 23333

Quantization	VRAM	Quality (perplexity)
FP16	16 GB	5.12 (baseline)
AWQ W4A16	8 GB	5.18 (+1.2%)
GPTQ W4A16	8 GB	5.21 (+1.8%)
KV Cache INT8	-30% KV	Negligible loss

OpenAI-Compatible API

Drop-in replacement for OpenAI applications:

from openai import OpenAI

# Point to LMDeploy server
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:23333/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    temperature=0.7,
    max_tokens=500,
)
print(response.choices[0].message.content)

Works with LangChain, LlamaIndex, and any OpenAI SDK client.

Multi-GPU Tensor Parallelism

# Serve a 70B model across 4 GPUs
lmdeploy serve api_server meta-llama/Meta-Llama-3-70B-Instruct \
  --tp 4 \
  --server-port 23333

Vision-Language Model Support

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline("OpenGVLab/InternVL2-8B")
image = load_image("photo.jpg")
response = pipe([{"role": "user", "content": [image, "Describe this image"]}])
print(response[0].text)

Supported VLMs: InternVL 2, LLaVA, CogVLM, MiniCPM-V.

Supported Models

Family	Models
Llama	Llama 3, Llama 3.1, Llama 3.2, CodeLlama
Qwen	Qwen 2, Qwen 2.5, Qwen-VL
DeepSeek	DeepSeek V3, DeepSeek R1, DeepSeek Coder
InternLM	InternLM 2, InternLM 2.5
Mistral	Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
Others	Gemma, Phi-3, Yi, Baichuan, ChatGLM

FAQ

Q: What is LMDeploy? A: LMDeploy is a production-grade LLM deployment toolkit with 7,700+ GitHub stars, featuring TurboMind engine for 1.8x higher throughput than vLLM, 4-bit quantization, tensor parallelism, and OpenAI-compatible API serving.

Q: When should I use LMDeploy instead of vLLM? A: Use LMDeploy when you need maximum throughput per GPU, especially with quantized models. LMDeploy's TurboMind engine consistently benchmarks 1.5-1.8x faster than vLLM. vLLM has broader community adoption; LMDeploy has better raw performance.

Q: Is LMDeploy free? A: Yes, open-source under Apache-2.0.