What is LMDeploy — High-Performance LLM Deployment Toolkit?

Deploy and serve LLMs with 1.8x higher throughput than vLLM. 4-bit quantization, OpenAI-compatible API. By InternLM. 7.7K+ stars.

How do I install LMDeploy — High-Performance LLM Deployment Toolkit?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LMDeploy — High-Performance LLM Deployment Toolkit

from openai import OpenAI client = OpenAI(api_key="not-needed", base_url="http://localhost:23333/v1") response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "什么是量子计算？"}], ) print(response.choices[0].message.content)

简介

LMDeploy 是上海人工智能实验室 InternLM 团队开发的生产级 LLM 部署工具包，拥有 7,700+ GitHub stars。其 TurboMind 引擎比 vLLM 吞吐量高 1.8 倍，支持 4-bit 量化（AWQ、GPTQ）、张量并行和 OpenAI 兼容 API，可作为任何 OpenAI 应用的直接替代。同时支持视觉语言模型（InternVL、LLaVA）。

适用于：NVIDIA GPU、华为昇腾、Llama 3、Qwen 2、DeepSeek、InternLM。适合在生产环境中追求最大 GPU 性价比的 LLM 部署团队。

核心特性

TurboMind 引擎

持续批处理、分块 KV 缓存、优化 CUDA 内核，吞吐量比 vLLM 高 1.8x。

4-bit 量化

AWQ/GPTQ 量化将显存需求减半，质量损失不到 2%。

OpenAI 兼容 API

即插即用，兼容 LangChain、LlamaIndex 和所有 OpenAI SDK 客户端。

多 GPU 张量并行

跨多 GPU 分布模型，轻松部署 70B+ 参数模型。

视觉语言模型

支持 InternVL 2、LLaVA、CogVLM 等多模态模型。

LMDeploy — High-Performance LLM Deployment Toolkit

先拿来用，再决定要不要深挖

简介

核心特性

TurboMind 引擎

4-bit 量化

OpenAI 兼容 API

多 GPU 张量并行

视觉语言模型

来源与感谢

讨论

相关资产

Smithery — MCP Server Registry and Installer

Tempo MCP — Calendar and Time Tracking for Agents

Git MCP — Version Control Server for AI Agents