Skills2026年5月3日·1 分钟阅读

LMCache — Supercharge LLM Inference with KV Cache Sharing

LMCache is an open-source KV cache management layer that accelerates LLM inference by sharing and reusing key-value caches across requests, reducing time-to-first-token and GPU memory usage.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
LMCache KV Caching
通用 CLI 安装命令
npx tokrepo install bc541841-470d-11f1-9bc6-00163e2b0d79

Introduction

LMCache is an open-source library that adds a high-performance KV cache sharing and reuse layer to LLM serving engines. By caching and retrieving computed key-value tensors across requests that share common prefixes (system prompts, few-shot examples, document contexts), LMCache significantly reduces time-to-first-token and GPU memory consumption without sacrificing output quality.

What LMCache Does

  • Caches KV tensors from LLM attention layers and reuses them across requests with shared prefixes
  • Reduces time-to-first-token by skipping redundant prefill computation for cached prefixes
  • Stores KV caches in GPU memory, CPU memory, or remote storage with tiered eviction
  • Integrates as a plugin with vLLM and SGLang serving backends
  • Supports multi-instance cache sharing across distributed serving replicas

Architecture Overview

LMCache intercepts the prefill stage of LLM inference and checks whether KV tensors for the input prefix already exist in the cache hierarchy. The cache is organized in token-aligned chunks with content-based hashing for prefix matching. A tiered storage system keeps hot caches on GPU, warm caches in CPU DRAM, and cold caches on remote storage (Redis, S3). When a cache hit occurs, the serving engine skips prefill for the matched prefix and begins generation from the cached state.

Self-Hosting & Configuration

  • Install via pip: pip install lmcache alongside your LLM serving engine
  • Create a YAML config file specifying cache storage backends and eviction policies
  • Set chunk size and hash granularity based on your typical prefix lengths
  • Enable remote caching with Redis for multi-instance deployments
  • Monitor cache hit rates via the built-in metrics endpoint

Key Features

  • Prefix-aware KV caching eliminates redundant prefill computation
  • Tiered storage (GPU, CPU, remote) with configurable eviction policies
  • Token-level chunking enables partial prefix cache hits
  • Multi-instance cache sharing across distributed serving replicas via remote storage
  • Compatible with vLLM and SGLang without modifying model code

Comparison with Similar Tools

  • vLLM prefix caching — Built-in but single-instance only; LMCache adds cross-instance and tiered storage
  • SGLang RadixAttention — Radix-tree-based caching; LMCache provides a pluggable layer with remote storage
  • PagedAttention — Manages KV memory within a single request; LMCache shares across requests
  • Mooncake — Disaggregated serving with KV transfer; LMCache focuses on caching and reuse
  • Prompt caching (API-level) — Provider-side feature; LMCache gives you self-hosted control over caching

FAQ

Q: How much does LMCache reduce time-to-first-token? A: For requests with shared prefixes (system prompts, document context), LMCache can reduce TTFT by 50-90% by skipping prefill for cached portions.

Q: Does LMCache change the model outputs? A: No. KV cache reuse is mathematically equivalent to recomputing the prefill — outputs are bit-identical.

Q: Which serving engines are supported? A: LMCache currently integrates with vLLM and SGLang as serving backends, with a plugin API for adding others.

Q: Can multiple serving instances share a cache? A: Yes, by configuring a remote storage backend (Redis or S3), multiple vLLM or SGLang instances can share cached KV tensors.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产