KnowledgeMay 19, 2026·2 min read

LLM Prompt Caching — Cache-Key Design Runbook

LLM prompt caching techniques for agents and apps. Covers stable prefixes, cache keys, TTLs, metrics, and cached-output validation.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Knowledge
Install
Single
Trust
Verified publisher
Entrypoint
README.md
Direct install command
npx -y tokrepo@latest install bf4d41a0-4a12-4f95-83e3-417a6ddae333 --target codex

Run after dry-run confirms the install plan.

What To Cache

Good prompt-cache candidates:

  • Long system prompts that change only when the app ships.
  • Policy packs, output schemas, rubrics, and static examples.
  • Tool descriptions when the tool catalog is stable.
  • Retrieval snippets that are shared by many users and have a clear version.

Bad prompt-cache candidates:

  • User-specific files, emails, tickets, or private context.
  • Prompts containing API tokens or session cookies.
  • Live market, legal, medical, or breaking-news facts.
  • Anything where a stale answer can cause a destructive action.

The most useful LLM prompt caching techniques are boring: stable-prefix extraction, schema-versioned keys, TTL or deploy-version invalidation, and cached-vs-uncached evaluation. Avoid clever semantic cache reuse until those basics are measured.

Validation Checklist

Before enabling prompt caching, verify these gates:

  1. The cache key includes model name and prompt schema version.
  2. Volatile user input is excluded from the reusable prefix key.
  3. Cache entries have a TTL or deploy-version invalidation rule.
  4. Evaluation compares cached and uncached outputs on at least 20 real tasks.
  5. Logs report hit rate, saved input tokens, first-token latency, and stale-cache rejects.

Common Failure Modes

  • Over-broad key: two different policy versions share a cache entry.
  • Under-broad key: every user message creates a unique key, so hit rate stays near zero.
  • Hidden volatility: the system prompt embeds today's date or account-specific state.
  • Silent stale behavior: the cache works technically, but no metric shows wrong reuse.
🙏

Source & Thanks

This is an original TokRepo runbook by William Wang. It aligns with the general prompt caching concepts described in the Anthropic prompt caching docs and the OpenAI prompt caching guide.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets