KnowledgeMay 19, 2026·2 min read

LLM Prompt Caching — Cache-Key Design Runbook

LLM prompt caching techniques for agents and apps. Covers stable prefixes, cache keys, TTLs, metrics, and cached-output validation.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Knowledge
Install
Single
Trust
Verified publisher
Entrypoint
README.md
Universal CLI install command
npx tokrepo install bf4d41a0-4a12-4f95-83e3-417a6ddae333

What To Cache

Good prompt-cache candidates:

  • Long system prompts that change only when the app ships.
  • Policy packs, output schemas, rubrics, and static examples.
  • Tool descriptions when the tool catalog is stable.
  • Retrieval snippets that are shared by many users and have a clear version.

Bad prompt-cache candidates:

  • User-specific files, emails, tickets, or private context.
  • Prompts containing API tokens or session cookies.
  • Live market, legal, medical, or breaking-news facts.
  • Anything where a stale answer can cause a destructive action.

The most useful LLM prompt caching techniques are boring: stable-prefix extraction, schema-versioned keys, TTL or deploy-version invalidation, and cached-vs-uncached evaluation. Avoid clever semantic cache reuse until those basics are measured.

Validation Checklist

Before enabling prompt caching, verify these gates:

  1. The cache key includes model name and prompt schema version.
  2. Volatile user input is excluded from the reusable prefix key.
  3. Cache entries have a TTL or deploy-version invalidation rule.
  4. Evaluation compares cached and uncached outputs on at least 20 real tasks.
  5. Logs report hit rate, saved input tokens, first-token latency, and stale-cache rejects.

Common Failure Modes

  • Over-broad key: two different policy versions share a cache entry.
  • Under-broad key: every user message creates a unique key, so hit rate stays near zero.
  • Hidden volatility: the system prompt embeds today's date or account-specific state.
  • Silent stale behavior: the cache works technically, but no metric shows wrong reuse.
🙏

Source & Thanks

This is an original TokRepo runbook by William Wang. It aligns with the general prompt caching concepts described in the Anthropic prompt caching docs and the OpenAI prompt caching guide.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets