What To Cache
Good prompt-cache candidates:
- Long system prompts that change only when the app ships.
- Policy packs, output schemas, rubrics, and static examples.
- Tool descriptions when the tool catalog is stable.
- Retrieval snippets that are shared by many users and have a clear version.
Bad prompt-cache candidates:
- User-specific files, emails, tickets, or private context.
- Prompts containing API tokens or session cookies.
- Live market, legal, medical, or breaking-news facts.
- Anything where a stale answer can cause a destructive action.
The most useful LLM prompt caching techniques are boring: stable-prefix extraction, schema-versioned keys, TTL or deploy-version invalidation, and cached-vs-uncached evaluation. Avoid clever semantic cache reuse until those basics are measured.
Validation Checklist
Before enabling prompt caching, verify these gates:
- The cache key includes model name and prompt schema version.
- Volatile user input is excluded from the reusable prefix key.
- Cache entries have a TTL or deploy-version invalidation rule.
- Evaluation compares cached and uncached outputs on at least 20 real tasks.
- Logs report hit rate, saved input tokens, first-token latency, and stale-cache rejects.
Common Failure Modes
- Over-broad key: two different policy versions share a cache entry.
- Under-broad key: every user message creates a unique key, so hit rate stays near zero.
- Hidden volatility: the system prompt embeds today's date or account-specific state.
- Silent stale behavior: the cache works technically, but no metric shows wrong reuse.