Prompts2026年4月8日·1 分钟阅读

Anthropic Prompt Caching — Cut AI API Costs 90%

Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings.

Agent 就绪

先审查再安装

这个资产需要先审查。复制的指令会要求 Agent dry-run、列出写入项,确认后再继续。

Needs Confirmation · 62/100策略:需确认
Agent 入口
任意 MCP/CLI Agent
类型
Prompt
安装
Single
信任
信任等级:Community
入口
Anthropic Prompt Caching — Cut AI API Costs 90%
先审查命令
npx -y tokrepo@latest install ed25d3cb-413d-40e5-8c36-a063e8a5ca99 --target codex

先 dry-run,确认写入项后再运行此命令。

TL;DR
Anthropic prompt caching lets you cache system prompts and long contexts to cut Claude API costs by up to 90%.
§01

What it is

Anthropic prompt caching is an API feature that lets you cache frequently reused content (system prompts, tool definitions, long documents) across multiple Claude API requests. Cached tokens are read at a fraction of the cost of uncached input tokens, reducing total API spending significantly for applications that reuse the same context.

This feature targets developers building applications that send the same system prompt, tool definitions, or reference documents with every request. Chatbots, code assistants, and RAG pipelines benefit the most because they repeat large context blocks across conversations.

§02

How it saves time or tokens

Without caching, every API request processes the full system prompt and context from scratch. With caching, the first request pays the full price plus a small cache write fee, but all subsequent requests read cached tokens at a 90% discount. For a 10,000-token system prompt sent across 100 requests, you pay for 10,000 tokens once instead of 1,000,000 tokens total. Cached content also reduces latency because the model does not need to reprocess it.

§03

How to use

  1. Add cache_control to content blocks you want cached:
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model='claude-sonnet-4-20250514',
    max_tokens=1024,
    system=[{
        'type': 'text',
        'text': 'You are an expert code reviewer... (long system prompt)',
        'cache_control': {'type': 'ephemeral'}
    }],
    messages=[{'role': 'user', 'content': 'Review this function.'}]
)
  1. Cache tool definitions:
response = client.messages.create(
    model='claude-sonnet-4-20250514',
    max_tokens=1024,
    tools=[{
        'name': 'search_codebase',
        'description': 'Search the codebase for patterns',
        'input_schema': {'type': 'object', 'properties': {'query': {'type': 'string'}}},
        'cache_control': {'type': 'ephemeral'}
    }],
    messages=[{'role': 'user', 'content': 'Find all TODO comments.'}]
)
  1. Check cache usage in the response:
print(response.usage.cache_creation_input_tokens)  # Tokens cached on first call
print(response.usage.cache_read_input_tokens)       # Tokens read from cache
§04

Example

# Caching a long document for RAG
import anthropic

client = anthropic.Anthropic()

long_document = open('docs/api-reference.md').read()

response = client.messages.create(
    model='claude-sonnet-4-20250514',
    max_tokens=2048,
    system=[{
        'type': 'text',
        'text': f'Reference document:\n\n{long_document}',
        'cache_control': {'type': 'ephemeral'}
    }],
    messages=[{'role': 'user', 'content': 'What authentication methods does the API support?'}]
)
§05

Related on TokRepo

§06

Common pitfalls

  • Cache has a minimum token threshold (currently 1024 tokens for Claude Sonnet); content blocks smaller than this threshold will not be cached.
  • Cached content expires after a TTL (time-to-live) period; for ephemeral caching, the cache lasts approximately 5 minutes of inactivity. Plan your request frequency accordingly.
  • Cache write tokens cost 25% more than regular input tokens; caching only saves money when the same content is reused across multiple requests.

常见问题

How much does prompt caching save?+

Cache read tokens cost approximately 90% less than regular input tokens. For applications that reuse the same system prompt or context across many requests, this translates to significant cost reduction. The exact savings depend on your cache hit rate and the size of cached content.

What content can I cache?+

You can cache system prompts, tool definitions, and content within message blocks. Add a cache_control field with type 'ephemeral' to any content block you want cached. The content must meet the minimum token threshold.

How long does the cache last?+

Ephemeral caches last approximately 5 minutes of inactivity. Each cache hit refreshes the TTL. If no requests use the cached content within the TTL window, it expires and the next request incurs a cache write fee.

Does caching affect response quality?+

No. Caching only affects how input tokens are processed and billed. The model produces identical responses whether content is cached or not. Caching is a performance and cost optimization, not a quality tradeoff.

Which Claude models support prompt caching?+

Prompt caching is available on Claude Sonnet, Claude Opus, and Claude Haiku via the Anthropic API. Check the Anthropic documentation for the latest model support and minimum token thresholds.

引用来源 (3)
🙏

来源与感谢

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产