判定清单
- cache key 包含 model 和 prompt schema version。
- 用户输入不进入可复用前缀 key。
- cache 有 TTL 或随部署版本失效。
- 至少用 20 个真实任务对比 cached / uncached 输出。
- 日志记录 hit rate、省下的 input tokens、首 token 延迟和 stale reject。
LLM prompt caching techniques for agents and apps. Covers stable prefixes, cache keys, TTLs, metrics, and cached-output validation.
Helicone Cache short-circuits identical LLM requests at the proxy. Set Helicone-Cache-Enabled header, exact-match responses come back in ms at zero cost.
One-click prompt to upgrade your AI agent memory system to Karpathy LLM Wiki pattern. Send to Claude Code / Cursor / Windsurf — auto audits, compiles fragments, resolves contradictions, builds structured wiki.
MCP tool calling latency runbook for agents. Measures tools/list p95, separates server latency from network delay, and defines pause rules.
Embedding drift monitoring runbook for RAG and agent search. Uses golden queries, recall@K, rank delta, and rollback gates.