Configs2026年5月3日·1 分钟阅读

LMCache — Supercharge LLM Inference with KV Cache Sharing

LMCache is an open-source KV cache management layer that accelerates LLM inference by sharing and reusing key-value caches across requests, reducing time-to-first-token and GPU memory usage.

Introduction

LMCache is an open-source library that adds a high-performance KV cache sharing and reuse layer to LLM serving engines. By caching and retrieving computed key-value tensors across requests that share common prefixes (system prompts, few-shot examples, document contexts), LMCache significantly reduces time-to-first-token and GPU memory consumption without sacrificing output quality.

What LMCache Does

  • Caches KV tensors from LLM attention layers and reuses them across requests with shared prefixes
  • Reduces time-to-first-token by skipping redundant prefill computation for cached prefixes
  • Stores KV caches in GPU memory, CPU memory, or remote storage with tiered eviction
  • Integrates as a plugin with vLLM and SGLang serving backends
  • Supports multi-instance cache sharing across distributed serving replicas

Architecture Overview

LMCache intercepts the prefill stage of LLM inference and checks whether KV tensors for the input prefix already exist in the cache hierarchy. The cache is organized in token-aligned chunks with content-based hashing for prefix matching. A tiered storage system keeps hot caches on GPU, warm caches in CPU DRAM, and cold caches on remote storage (Redis, S3). When a cache hit occurs, the serving engine skips prefill for the matched prefix and begins generation from the cached state.

Self-Hosting & Configuration

  • Install via pip: pip install lmcache alongside your LLM serving engine
  • Create a YAML config file specifying cache storage backends and eviction policies
  • Set chunk size and hash granularity based on your typical prefix lengths
  • Enable remote caching with Redis for multi-instance deployments
  • Monitor cache hit rates via the built-in metrics endpoint

Key Features

  • Prefix-aware KV caching eliminates redundant prefill computation
  • Tiered storage (GPU, CPU, remote) with configurable eviction policies
  • Token-level chunking enables partial prefix cache hits
  • Multi-instance cache sharing across distributed serving replicas via remote storage
  • Compatible with vLLM and SGLang without modifying model code

Comparison with Similar Tools

  • vLLM prefix caching — Built-in but single-instance only; LMCache adds cross-instance and tiered storage
  • SGLang RadixAttention — Radix-tree-based caching; LMCache provides a pluggable layer with remote storage
  • PagedAttention — Manages KV memory within a single request; LMCache shares across requests
  • Mooncake — Disaggregated serving with KV transfer; LMCache focuses on caching and reuse
  • Prompt caching (API-level) — Provider-side feature; LMCache gives you self-hosted control over caching

FAQ

Q: How much does LMCache reduce time-to-first-token? A: For requests with shared prefixes (system prompts, document context), LMCache can reduce TTFT by 50-90% by skipping prefill for cached portions.

Q: Does LMCache change the model outputs? A: No. KV cache reuse is mathematically equivalent to recomputing the prefill — outputs are bit-identical.

Q: Which serving engines are supported? A: LMCache currently integrates with vLLM and SGLang as serving backends, with a plugin API for adding others.

Q: Can multiple serving instances share a cache? A: Yes, by configuring a remote storage backend (Redis or S3), multiple vLLM or SGLang instances can share cached KV tensors.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产