MCP ConfigsApr 2, 2026·3 min read

LMDeploy — High-Performance LLM Deployment Toolkit

Deploy and serve LLMs with 1.8x higher throughput than vLLM. 4-bit quantization, OpenAI-compatible API. By InternLM. 7.7K+ stars.

TO
TokRepo精选 · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

```bash pip install lmdeploy ``` ```bash # Serve a model with OpenAI-compatible API lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct \ --server-port 23333 # Now query it like OpenAI curl http://localhost:23333/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ```python from lmdeploy import pipeline # Simple Python inference pipe = pipeline("meta-llama/Meta-Llama-3-8B-Instruct") response = pipe(["What is machine learning?"]) print(response[0].text) ``` For 4-bit quantized inference (2x less VRAM): ```bash lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct \ --model-format awq \ --server-port 23333 ``` ---
Intro
LMDeploy is a production-grade toolkit with 7,700+ GitHub stars for compressing, deploying, and serving large language models. Developed by the InternLM team at Shanghai AI Laboratory, its TurboMind engine delivers up to 1.8x higher throughput than vLLM with persistent batching, blocked KV cache, and optimized CUDA kernels. It supports 4-bit quantization (AWQ, GPTQ), tensor parallelism across GPUs, and an OpenAI-compatible API — making it a drop-in replacement for any OpenAI-based application. Supports LLMs (Llama 3, Qwen 2, DeepSeek V3) and VLMs (InternVL, LLaVA). Works with: NVIDIA GPUs, Huawei Ascend, Llama 3, Qwen 2, DeepSeek, InternLM, Mistral, InternVL, LLaVA. Best for teams deploying LLMs in production who need maximum throughput per GPU dollar. Setup time: under 5 minutes. ---
## LMDeploy Architecture & Performance ### TurboMind Engine The high-performance inference backend: | Feature | Benefit | |---------|---------| | **Persistent Batching** | Continuously processes requests without batch boundaries | | **Blocked KV Cache** | Memory-efficient attention cache, no fragmentation | | **Paged Attention** | Dynamic memory allocation like OS virtual memory | | **Optimized CUDA Kernels** | Custom kernels for attention, GEMM, and rotary embeddings | | **Tensor Parallelism** | Distribute model across multiple GPUs | ### Performance Benchmarks Tested on Llama 3 8B (A100 80GB): | Metric | vLLM | LMDeploy | Improvement | |--------|------|----------|-------------| | **Throughput (tok/s)** | 3,200 | 5,800 | **1.8x** | | **First token latency** | 45ms | 38ms | **16% faster** | | **Memory usage** | 24GB | 22GB | **8% less** | ### Quantization Reduce VRAM usage with minimal quality loss: ```bash # AWQ 4-bit quantization (recommended) lmdeploy lite auto_awq meta-llama/Meta-Llama-3-8B-Instruct \ --work-dir llama3-8b-awq4 # Serve the quantized model lmdeploy serve api_server llama3-8b-awq4 \ --model-format awq \ --server-port 23333 ``` | Quantization | VRAM | Quality (perplexity) | |-------------|------|---------------------| | FP16 | 16 GB | 5.12 (baseline) | | AWQ W4A16 | 8 GB | 5.18 (+1.2%) | | GPTQ W4A16 | 8 GB | 5.21 (+1.8%) | | KV Cache INT8 | -30% KV | Negligible loss | ### OpenAI-Compatible API Drop-in replacement for OpenAI applications: ```python from openai import OpenAI # Point to LMDeploy server client = OpenAI( api_key="not-needed", base_url="http://localhost:23333/v1" ) response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "Explain quantum computing"}], temperature=0.7, max_tokens=500, ) print(response.choices[0].message.content) ``` Works with LangChain, LlamaIndex, and any OpenAI SDK client. ### Multi-GPU Tensor Parallelism ```bash # Serve a 70B model across 4 GPUs lmdeploy serve api_server meta-llama/Meta-Llama-3-70B-Instruct \ --tp 4 \ --server-port 23333 ``` ### Vision-Language Model Support ```python from lmdeploy import pipeline from lmdeploy.vl import load_image pipe = pipeline("OpenGVLab/InternVL2-8B") image = load_image("photo.jpg") response = pipe([{"role": "user", "content": [image, "Describe this image"]}]) print(response[0].text) ``` Supported VLMs: InternVL 2, LLaVA, CogVLM, MiniCPM-V. ### Supported Models | Family | Models | |--------|--------| | **Llama** | Llama 3, Llama 3.1, Llama 3.2, CodeLlama | | **Qwen** | Qwen 2, Qwen 2.5, Qwen-VL | | **DeepSeek** | DeepSeek V3, DeepSeek R1, DeepSeek Coder | | **InternLM** | InternLM 2, InternLM 2.5 | | **Mistral** | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B | | **Others** | Gemma, Phi-3, Yi, Baichuan, ChatGLM | --- ## FAQ **Q: What is LMDeploy?** A: LMDeploy is a production-grade LLM deployment toolkit with 7,700+ GitHub stars, featuring TurboMind engine for 1.8x higher throughput than vLLM, 4-bit quantization, tensor parallelism, and OpenAI-compatible API serving. **Q: When should I use LMDeploy instead of vLLM?** A: Use LMDeploy when you need maximum throughput per GPU, especially with quantized models. LMDeploy's TurboMind engine consistently benchmarks 1.5-1.8x faster than vLLM. vLLM has broader community adoption; LMDeploy has better raw performance. **Q: Is LMDeploy free?** A: Yes, open-source under Apache-2.0. ---
🙏

Source & Thanks

> Created by [InternLM](https://github.com/InternLM) (Shanghai AI Laboratory). Licensed under Apache-2.0. > > [lmdeploy](https://github.com/InternLM/lmdeploy) — ⭐ 7,700+ Thanks to the InternLM team for pushing the boundaries of LLM serving performance.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets