# LMDeploy — High-Performance LLM Deployment Toolkit > Deploy and serve LLMs with 1.8x higher throughput than vLLM. 4-bit quantization, OpenAI-compatible API. By InternLM. 7.7K+ stars. ## Install Merge the JSON below into your `.mcp.json`: # LMDeploy — High-Performance LLM Deployment Toolkit ## Quick Use ```bash pip install lmdeploy ``` ```bash # Serve a model with OpenAI-compatible API lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct \ --server-port 23333 # Now query it like OpenAI curl http://localhost:23333/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ```python from lmdeploy import pipeline # Simple Python inference pipe = pipeline("meta-llama/Meta-Llama-3-8B-Instruct") response = pipe(["What is machine learning?"]) print(response[0].text) ``` For 4-bit quantized inference (2x less VRAM): ```bash lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct \ --model-format awq \ --server-port 23333 ``` --- ## Intro LMDeploy is a production-grade toolkit with 7,700+ GitHub stars for compressing, deploying, and serving large language models. Developed by the InternLM team at Shanghai AI Laboratory, its TurboMind engine delivers up to 1.8x higher throughput than vLLM with persistent batching, blocked KV cache, and optimized CUDA kernels. It supports 4-bit quantization (AWQ, GPTQ), tensor parallelism across GPUs, and an OpenAI-compatible API — making it a drop-in replacement for any OpenAI-based application. Supports LLMs (Llama 3, Qwen 2, DeepSeek V3) and VLMs (InternVL, LLaVA). Works with: NVIDIA GPUs, Huawei Ascend, Llama 3, Qwen 2, DeepSeek, InternLM, Mistral, InternVL, LLaVA. Best for teams deploying LLMs in production who need maximum throughput per GPU dollar. Setup time: under 5 minutes. --- ## LMDeploy Architecture & Performance ### TurboMind Engine The high-performance inference backend: | Feature | Benefit | |---------|---------| | **Persistent Batching** | Continuously processes requests without batch boundaries | | **Blocked KV Cache** | Memory-efficient attention cache, no fragmentation | | **Paged Attention** | Dynamic memory allocation like OS virtual memory | | **Optimized CUDA Kernels** | Custom kernels for attention, GEMM, and rotary embeddings | | **Tensor Parallelism** | Distribute model across multiple GPUs | ### Performance Benchmarks Tested on Llama 3 8B (A100 80GB): | Metric | vLLM | LMDeploy | Improvement | |--------|------|----------|-------------| | **Throughput (tok/s)** | 3,200 | 5,800 | **1.8x** | | **First token latency** | 45ms | 38ms | **16% faster** | | **Memory usage** | 24GB | 22GB | **8% less** | ### Quantization Reduce VRAM usage with minimal quality loss: ```bash # AWQ 4-bit quantization (recommended) lmdeploy lite auto_awq meta-llama/Meta-Llama-3-8B-Instruct \ --work-dir llama3-8b-awq4 # Serve the quantized model lmdeploy serve api_server llama3-8b-awq4 \ --model-format awq \ --server-port 23333 ``` | Quantization | VRAM | Quality (perplexity) | |-------------|------|---------------------| | FP16 | 16 GB | 5.12 (baseline) | | AWQ W4A16 | 8 GB | 5.18 (+1.2%) | | GPTQ W4A16 | 8 GB | 5.21 (+1.8%) | | KV Cache INT8 | -30% KV | Negligible loss | ### OpenAI-Compatible API Drop-in replacement for OpenAI applications: ```python from openai import OpenAI # Point to LMDeploy server client = OpenAI( api_key="not-needed", base_url="http://localhost:23333/v1" ) response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "Explain quantum computing"}], temperature=0.7, max_tokens=500, ) print(response.choices[0].message.content) ``` Works with LangChain, LlamaIndex, and any OpenAI SDK client. ### Multi-GPU Tensor Parallelism ```bash # Serve a 70B model across 4 GPUs lmdeploy serve api_server meta-llama/Meta-Llama-3-70B-Instruct \ --tp 4 \ --server-port 23333 ``` ### Vision-Language Model Support ```python from lmdeploy import pipeline from lmdeploy.vl import load_image pipe = pipeline("OpenGVLab/InternVL2-8B") image = load_image("photo.jpg") response = pipe([{"role": "user", "content": [image, "Describe this image"]}]) print(response[0].text) ``` Supported VLMs: InternVL 2, LLaVA, CogVLM, MiniCPM-V. ### Supported Models | Family | Models | |--------|--------| | **Llama** | Llama 3, Llama 3.1, Llama 3.2, CodeLlama | | **Qwen** | Qwen 2, Qwen 2.5, Qwen-VL | | **DeepSeek** | DeepSeek V3, DeepSeek R1, DeepSeek Coder | | **InternLM** | InternLM 2, InternLM 2.5 | | **Mistral** | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B | | **Others** | Gemma, Phi-3, Yi, Baichuan, ChatGLM | --- ## FAQ **Q: What is LMDeploy?** A: LMDeploy is a production-grade LLM deployment toolkit with 7,700+ GitHub stars, featuring TurboMind engine for 1.8x higher throughput than vLLM, 4-bit quantization, tensor parallelism, and OpenAI-compatible API serving. **Q: When should I use LMDeploy instead of vLLM?** A: Use LMDeploy when you need maximum throughput per GPU, especially with quantized models. LMDeploy's TurboMind engine consistently benchmarks 1.5-1.8x faster than vLLM. vLLM has broader community adoption; LMDeploy has better raw performance. **Q: Is LMDeploy free?** A: Yes, open-source under Apache-2.0. --- ## Source & Thanks > Created by [InternLM](https://github.com/InternLM) (Shanghai AI Laboratory). Licensed under Apache-2.0. > > [lmdeploy](https://github.com/InternLM/lmdeploy) — ⭐ 7,700+ Thanks to the InternLM team for pushing the boundaries of LLM serving performance. --- ## 快速使用 ```bash pip install lmdeploy ``` ```bash # 启动 OpenAI 兼容 API 服务 lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --server-port 23333 ``` ```python from openai import OpenAI client = OpenAI(api_key="not-needed", base_url="http://localhost:23333/v1") response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "什么是量子计算?"}], ) print(response.choices[0].message.content) ``` --- ## 简介 LMDeploy 是上海人工智能实验室 InternLM 团队开发的生产级 LLM 部署工具包,拥有 7,700+ GitHub stars。其 TurboMind 引擎比 vLLM 吞吐量高 1.8 倍,支持 4-bit 量化(AWQ、GPTQ)、张量并行和 OpenAI 兼容 API,可作为任何 OpenAI 应用的直接替代。同时支持视觉语言模型(InternVL、LLaVA)。 适用于:NVIDIA GPU、华为昇腾、Llama 3、Qwen 2、DeepSeek、InternLM。适合在生产环境中追求最大 GPU 性价比的 LLM 部署团队。 --- ## 核心特性 ### TurboMind 引擎 持续批处理、分块 KV 缓存、优化 CUDA 内核,吞吐量比 vLLM 高 1.8x。 ### 4-bit 量化 AWQ/GPTQ 量化将显存需求减半,质量损失不到 2%。 ### OpenAI 兼容 API 即插即用,兼容 LangChain、LlamaIndex 和所有 OpenAI SDK 客户端。 ### 多 GPU 张量并行 跨多 GPU 分布模型,轻松部署 70B+ 参数模型。 ### 视觉语言模型 支持 InternVL 2、LLaVA、CogVLM 等多模态模型。 --- ## 来源与感谢 > Created by [InternLM](https://github.com/InternLM). Licensed under Apache-2.0. > > [lmdeploy](https://github.com/InternLM/lmdeploy) — ⭐ 7,700+ --- Source: https://tokrepo.com/en/workflows/3ed4f784-8fb5-4936-9b98-1a34a94567f2 Author: TokRepo精选