PromptsApr 8, 2026·3 min read

Anthropic Prompt Caching — Cut AI API Costs 90%

Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings.

TL;DR
Anthropic prompt caching lets you cache system prompts and long contexts to cut Claude API costs by up to 90%.
§01

What it is

Anthropic prompt caching is an API feature that lets you cache frequently reused content (system prompts, tool definitions, long documents) across multiple Claude API requests. Cached tokens are read at a fraction of the cost of uncached input tokens, reducing total API spending significantly for applications that reuse the same context.

This feature targets developers building applications that send the same system prompt, tool definitions, or reference documents with every request. Chatbots, code assistants, and RAG pipelines benefit the most because they repeat large context blocks across conversations.

§02

How it saves time or tokens

Without caching, every API request processes the full system prompt and context from scratch. With caching, the first request pays the full price plus a small cache write fee, but all subsequent requests read cached tokens at a 90% discount. For a 10,000-token system prompt sent across 100 requests, you pay for 10,000 tokens once instead of 1,000,000 tokens total. Cached content also reduces latency because the model does not need to reprocess it.

§03

How to use

  1. Add cache_control to content blocks you want cached:
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model='claude-sonnet-4-20250514',
    max_tokens=1024,
    system=[{
        'type': 'text',
        'text': 'You are an expert code reviewer... (long system prompt)',
        'cache_control': {'type': 'ephemeral'}
    }],
    messages=[{'role': 'user', 'content': 'Review this function.'}]
)
  1. Cache tool definitions:
response = client.messages.create(
    model='claude-sonnet-4-20250514',
    max_tokens=1024,
    tools=[{
        'name': 'search_codebase',
        'description': 'Search the codebase for patterns',
        'input_schema': {'type': 'object', 'properties': {'query': {'type': 'string'}}},
        'cache_control': {'type': 'ephemeral'}
    }],
    messages=[{'role': 'user', 'content': 'Find all TODO comments.'}]
)
  1. Check cache usage in the response:
print(response.usage.cache_creation_input_tokens)  # Tokens cached on first call
print(response.usage.cache_read_input_tokens)       # Tokens read from cache
§04

Example

# Caching a long document for RAG
import anthropic

client = anthropic.Anthropic()

long_document = open('docs/api-reference.md').read()

response = client.messages.create(
    model='claude-sonnet-4-20250514',
    max_tokens=2048,
    system=[{
        'type': 'text',
        'text': f'Reference document:\n\n{long_document}',
        'cache_control': {'type': 'ephemeral'}
    }],
    messages=[{'role': 'user', 'content': 'What authentication methods does the API support?'}]
)
§05

Related on TokRepo

§06

Common pitfalls

  • Cache has a minimum token threshold (currently 1024 tokens for Claude Sonnet); content blocks smaller than this threshold will not be cached.
  • Cached content expires after a TTL (time-to-live) period; for ephemeral caching, the cache lasts approximately 5 minutes of inactivity. Plan your request frequency accordingly.
  • Cache write tokens cost 25% more than regular input tokens; caching only saves money when the same content is reused across multiple requests.

Frequently Asked Questions

How much does prompt caching save?+

Cache read tokens cost approximately 90% less than regular input tokens. For applications that reuse the same system prompt or context across many requests, this translates to significant cost reduction. The exact savings depend on your cache hit rate and the size of cached content.

What content can I cache?+

You can cache system prompts, tool definitions, and content within message blocks. Add a cache_control field with type 'ephemeral' to any content block you want cached. The content must meet the minimum token threshold.

How long does the cache last?+

Ephemeral caches last approximately 5 minutes of inactivity. Each cache hit refreshes the TTL. If no requests use the cached content within the TTL window, it expires and the next request incurs a cache write fee.

Does caching affect response quality?+

No. Caching only affects how input tokens are processed and billed. The model produces identical responses whether content is cached or not. Caching is a performance and cost optimization, not a quality tradeoff.

Which Claude models support prompt caching?+

Prompt caching is available on Claude Sonnet, Claude Opus, and Claude Haiku via the Anthropic API. Check the Anthropic documentation for the latest model support and minimum token thresholds.

Citations (3)
🙏

Source & Thanks

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.