How do I install LMCache — Supercharge LLM Inference with KV Cache Sharing?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LMCache — Supercharge LLM Inference with KV Cache Sharing

Introduction

LMCache is an open-source library that adds a high-performance KV cache sharing and reuse layer to LLM serving engines. By caching and retrieving computed key-value tensors across requests that share common prefixes (system prompts, few-shot examples, document contexts), LMCache significantly reduces time-to-first-token and GPU memory consumption without sacrificing output quality.

What LMCache Does

Caches KV tensors from LLM attention layers and reuses them across requests with shared prefixes
Reduces time-to-first-token by skipping redundant prefill computation for cached prefixes
Stores KV caches in GPU memory, CPU memory, or remote storage with tiered eviction
Integrates as a plugin with vLLM and SGLang serving backends
Supports multi-instance cache sharing across distributed serving replicas

Architecture Overview

LMCache intercepts the prefill stage of LLM inference and checks whether KV tensors for the input prefix already exist in the cache hierarchy. The cache is organized in token-aligned chunks with content-based hashing for prefix matching. A tiered storage system keeps hot caches on GPU, warm caches in CPU DRAM, and cold caches on remote storage (Redis, S3). When a cache hit occurs, the serving engine skips prefill for the matched prefix and begins generation from the cached state.

Self-Hosting & Configuration

Install via pip: pip install lmcache alongside your LLM serving engine
Create a YAML config file specifying cache storage backends and eviction policies
Set chunk size and hash granularity based on your typical prefix lengths
Enable remote caching with Redis for multi-instance deployments
Monitor cache hit rates via the built-in metrics endpoint

Key Features

Prefix-aware KV caching eliminates redundant prefill computation
Tiered storage (GPU, CPU, remote) with configurable eviction policies
Token-level chunking enables partial prefix cache hits
Multi-instance cache sharing across distributed serving replicas via remote storage
Compatible with vLLM and SGLang without modifying model code

Comparison with Similar Tools

vLLM prefix caching — Built-in but single-instance only; LMCache adds cross-instance and tiered storage
SGLang RadixAttention — Radix-tree-based caching; LMCache provides a pluggable layer with remote storage
PagedAttention — Manages KV memory within a single request; LMCache shares across requests
Mooncake — Disaggregated serving with KV transfer; LMCache focuses on caching and reuse
Prompt caching (API-level) — Provider-side feature; LMCache gives you self-hosted control over caching

FAQ

Q: How much does LMCache reduce time-to-first-token? A: For requests with shared prefixes (system prompts, document context), LMCache can reduce TTFT by 50-90% by skipping prefill for cached portions.

Q: Does LMCache change the model outputs? A: No. KV cache reuse is mathematically equivalent to recomputing the prefill — outputs are bit-identical.

Q: Which serving engines are supported? A: LMCache currently integrates with vLLM and SGLang as serving backends, with a plugin API for adding others.

Q: Can multiple serving instances share a cache? A: Yes, by configuring a remote storage backend (Redis or S3), multiple vLLM or SGLang instances can share cached KV tensors.

LMCache — Supercharge LLM Inference with KV Cache Sharing

Introduction

What LMCache Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS

OpenVINO — Optimize and Deploy AI Inference Across Intel Hardware

nano-vllm — Lightweight LLM Serving Engine