MCP ConfigsApr 2, 2026·3 min read
LMDeploy — High-Performance LLM Deployment Toolkit
Deploy and serve LLMs with 1.8x higher throughput than vLLM. 4-bit quantization, OpenAI-compatible API. By InternLM. 7.7K+ stars.
TO
TokRepo精选 · Community
Quick Use
Use it first, then decide how deep to go
This block should tell both the user and the agent what to copy, install, and apply first.
```bash
pip install lmdeploy
```
```bash
# Serve a model with OpenAI-compatible API
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct \
--server-port 23333
# Now query it like OpenAI
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
```python
from lmdeploy import pipeline
# Simple Python inference
pipe = pipeline("meta-llama/Meta-Llama-3-8B-Instruct")
response = pipe(["What is machine learning?"])
print(response[0].text)
```
For 4-bit quantized inference (2x less VRAM):
```bash
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct \
--model-format awq \
--server-port 23333
```
---
Intro
LMDeploy is a production-grade toolkit with 7,700+ GitHub stars for compressing, deploying, and serving large language models. Developed by the InternLM team at Shanghai AI Laboratory, its TurboMind engine delivers up to 1.8x higher throughput than vLLM with persistent batching, blocked KV cache, and optimized CUDA kernels. It supports 4-bit quantization (AWQ, GPTQ), tensor parallelism across GPUs, and an OpenAI-compatible API — making it a drop-in replacement for any OpenAI-based application. Supports LLMs (Llama 3, Qwen 2, DeepSeek V3) and VLMs (InternVL, LLaVA).
Works with: NVIDIA GPUs, Huawei Ascend, Llama 3, Qwen 2, DeepSeek, InternLM, Mistral, InternVL, LLaVA. Best for teams deploying LLMs in production who need maximum throughput per GPU dollar. Setup time: under 5 minutes.
---
🙏
Source & Thanks
> Created by [InternLM](https://github.com/InternLM) (Shanghai AI Laboratory). Licensed under Apache-2.0.
>
> [lmdeploy](https://github.com/InternLM/lmdeploy) — ⭐ 7,700+
Thanks to the InternLM team for pushing the boundaries of LLM serving performance.
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.
Related Assets
OpenLIT — OpenTelemetry LLM Observability
Monitor LLM costs, latency, and quality with OpenTelemetry-native tracing. GPU monitoring and guardrails built in. 2.3K+ stars.
TokRepo精选
Agenta — Open-Source LLMOps Platform
Prompt playground, evaluation, and observability in one platform. Compare prompts, run evals, trace production calls. 4K+ stars.
TokRepo精选
Rerun — Visualize Multimodal AI Data in Real-Time
SDK for logging, storing, and visualizing 3D, images, time series, and text in real-time. Built for robotics and AI. 10K+ stars.
TokRepo精选