What is LMDeploy — High-Performance LLM Deployment Toolkit?

Deploy and serve LLMs with 1.8x higher throughput than vLLM. 4-bit quantization, OpenAI-compatible API. By InternLM. 7.7K+ stars.

Is LMDeploy — High-Performance LLM Deployment Toolkit free to use?

Yes. LMDeploy — High-Performance LLM Deployment Toolkit is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install LMDeploy — High-Performance LLM Deployment Toolkit?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LMDeploy — High-Performance LLM Deployment Toolkit

LMDeploy is a production-grade toolkit with 7,700+ GitHub stars for compressing, deploying, and serving large language models. Developed by the InternLM team at Shanghai AI Laboratory, its TurboMind engine delivers up to 1.8x higher throughput than vLLM with persistent batching, blocked KV cache, and optimized CUDA kernels. It supports 4-bit quantization (AWQ, GPTQ), tensor parallelism across GPUs, and an OpenAI-compatible API — making it a drop-in replacement for any OpenAI-based application. Supports LLMs (Llama 3, Qwen 2, DeepSeek V3) and VLMs (InternVL, LLaVA). Works with: NVIDIA GPUs, Huawei Ascend, Llama 3, Qwen 2, DeepSeek, InternLM, Mistral, InternVL, LLaVA. Best for teams deploying LLMs in production who need maximum throughput per GPU dollar. Setup time: under 5 minutes. ---

## LMDeploy Architecture & Performance ### TurboMind Engine The high-performance inference backend: | Feature | Benefit | |---------|---------| | **Persistent Batching** | Continuously processes requests without batch boundaries | | **Blocked KV Cache** | Memory-efficient attention cache, no fragmentation | | **Paged Attention** | Dynamic memory allocation like OS virtual memory | | **Optimized CUDA Kernels** | Custom kernels for attention, GEMM, and rotary embeddings | | **Tensor Parallelism** | Distribute model across multiple GPUs | ### Performance Benchmarks Tested on Llama 3 8B (A100 80GB): | Metric | vLLM | LMDeploy | Improvement | |--------|------|----------|-------------| | **Throughput (tok/s)** | 3,200 | 5,800 | **1.8x** | | **First token latency** | 45ms | 38ms | **16% faster** | | **Memory usage** | 24GB | 22GB | **8% less** | ### Quantization Reduce VRAM usage with minimal quality loss: ```bash # AWQ 4-bit quantization (recommended) lmdeploy lite auto_awq meta-llama/Meta-Llama-3-8B-Instruct \ --work-dir llama3-8b-awq4 # Serve the quantized model lmdeploy serve api_server llama3-8b-awq4 \ --model-format awq \ --server-port 23333 ``` | Quantization | VRAM | Quality (perplexity) | |-------------|------|---------------------| | FP16 | 16 GB | 5.12 (baseline) | | AWQ W4A16 | 8 GB | 5.18 (+1.2%) | | GPTQ W4A16 | 8 GB | 5.21 (+1.8%) | | KV Cache INT8 | -30% KV | Negligible loss | ### OpenAI-Compatible API Drop-in replacement for OpenAI applications: ```python from openai import OpenAI # Point to LMDeploy server client = OpenAI( api_key="not-needed", base_url="http://localhost:23333/v1" ) response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "Explain quantum computing"}], temperature=0.7, max_tokens=500, ) print(response.choices[0].message.content) ``` Works with LangChain, LlamaIndex, and any OpenAI SDK client. ### Multi-GPU Tensor Parallelism ```bash # Serve a 70B model across 4 GPUs lmdeploy serve api_server meta-llama/Meta-Llama-3-70B-Instruct \ --tp 4 \ --server-port 23333 ``` ### Vision-Language Model Support ```python from lmdeploy import pipeline from lmdeploy.vl import load_image pipe = pipeline("OpenGVLab/InternVL2-8B") image = load_image("photo.jpg") response = pipe([{"role": "user", "content": [image, "Describe this image"]}]) print(response[0].text) ``` Supported VLMs: InternVL 2, LLaVA, CogVLM, MiniCPM-V. ### Supported Models | Family | Models | |--------|--------| | **Llama** | Llama 3, Llama 3.1, Llama 3.2, CodeLlama | | **Qwen** | Qwen 2, Qwen 2.5, Qwen-VL | | **DeepSeek** | DeepSeek V3, DeepSeek R1, DeepSeek Coder | | **InternLM** | InternLM 2, InternLM 2.5 | | **Mistral** | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B | | **Others** | Gemma, Phi-3, Yi, Baichuan, ChatGLM | --- ## FAQ **Q: What is LMDeploy?** A: LMDeploy is a production-grade LLM deployment toolkit with 7,700+ GitHub stars, featuring TurboMind engine for 1.8x higher throughput than vLLM, 4-bit quantization, tensor parallelism, and OpenAI-compatible API serving. **Q: When should I use LMDeploy instead of vLLM?** A: Use LMDeploy when you need maximum throughput per GPU, especially with quantized models. LMDeploy's TurboMind engine consistently benchmarks 1.5-1.8x faster than vLLM. vLLM has broader community adoption; LMDeploy has better raw performance. **Q: Is LMDeploy free?** A: Yes, open-source under Apache-2.0. ---

LMDeploy — High-Performance LLM Deployment Toolkit

Use it first, then decide how deep to go

Source & Thanks

Discussion

Related Assets

OpenLIT — OpenTelemetry LLM Observability

Agenta — Open-Source LLMOps Platform

Rerun — Visualize Multimodal AI Data in Real-Time