# Anthropic Prompt Caching — Cut AI API Costs 90%

> Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings.

## Install

Paste the prompt below into your AI tool:

## Quick Use

```python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": "You are an expert code reviewer... (long system prompt)",
        "cache_control": {"type": "ephemeral"},  # Cache this!
    }],
    messages=[{"role": "user", "content": "Review this code..."}],
)
# First request: full price
# Subsequent requests: 90% cheaper for cached portion
```

## What is Prompt Caching?

Prompt caching lets you cache frequently reused content (system prompts, tool definitions, large documents) across API requests. Instead of re-processing the same tokens every time, Claude reads them from cache at 1/10th the cost. For applications with long system prompts or RAG context, this can reduce costs by 90%.

**Answer-Ready**: Anthropic's prompt caching reduces Claude API costs up to 90%. Cache system prompts, tool definitions, and documents across requests. Cached tokens cost 1/10th of input tokens. 5-minute TTL, auto-extended on hit. Essential for production AI applications with recurring context.

**Best for**: Teams running Claude API at scale with recurring system prompts. **Works with**: Claude Sonnet, Opus, Haiku via Anthropic API. **Setup time**: Add one field to existing code.

## How It Works

### Pricing Impact

| Token Type | Cost (Sonnet) | Savings |
|-----------|---------------|---------|
| Input (no cache) | $3/M tokens | Baseline |
| Cache write | $3.75/M tokens | -25% first time |
| Cache read | $0.30/M tokens | **90% savings** |
| Output | $15/M tokens | No change |

### Cache TTL
- Default: 5 minutes
- Auto-extended: Each cache hit resets the 5-minute timer
- Minimum cacheable: 1,024 tokens (Sonnet), 2,048 tokens (Opus)

## Cacheable Content

### 1. System Prompts

```python
system=[{
    "type": "text",
    "text": "Your 2000-token system prompt here...",
    "cache_control": {"type": "ephemeral"},
}]
```

### 2. Tool Definitions

```python
tools = [
    {"name": "search", "description": "...", "input_schema": {...},
     "cache_control": {"type": "ephemeral"}},
]
```

### 3. Long Documents (RAG Context)

```python
messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Here is the full codebase:\n" + large_document,
         "cache_control": {"type": "ephemeral"}},
        {"type": "text", "text": "Now answer: what does the auth module do?"},
    ]},
]
```

### 4. Multi-Turn with Cached Prefix

```python
# Turn 1: Cache the document
msg1 = {"role": "user", "content": [
    {"type": "text", "text": document, "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": "Summarize this document"},
]}

# Turn 2: Document is already cached, only new question costs full price
msg2 = {"role": "user", "content": "Now list the key findings"}
```

## Cost Calculator

| Scenario | Without Cache | With Cache | Savings |
|----------|--------------|------------|---------|
| 4K system prompt, 100 requests | $1.20 | $0.24 | 80% |
| 10K RAG context, 50 requests | $1.50 | $0.19 | 87% |
| 20K tools, 200 requests | $12.00 | $1.35 | 89% |

## Best Practices

1. **Cache the longest, most stable content first** — System prompts and tool definitions change rarely
2. **Order matters** — Cached content must be a prefix. Put cached content before dynamic content
3. **Monitor cache hits** — Check `usage.cache_creation_input_tokens` and `usage.cache_read_input_tokens` in response
4. **Minimum size** — Content must be >= 1,024 tokens to be cacheable
5. **Keep cache warm** — If requests are >5 minutes apart, cache expires

## FAQ

**Q: Does caching affect response quality?**
A: No. Caching is purely an optimization — the model sees identical input.

**Q: Can I cache across different conversations?**
A: Yes, if the cached prefix is identical (same system prompt + tools), it hits the cache regardless of the user message.

**Q: Does Claude Code use prompt caching?**
A: Yes, Claude Code automatically uses prompt caching for CLAUDE.md content, tool definitions, and conversation history.

## Source & Thanks

> - [Anthropic Prompt Caching Docs](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
> - [Pricing page](https://anthropic.com/pricing)

<!-- ZH -->

## 快速使用

在 system prompt 中添加 `cache_control` 字段，后续请求省 90% 成本。

## 什么是 Prompt Caching？

Anthropic 的提示缓存，将重复使用的内容（系统提示、工具定义、长文档）缓存在请求间复用，缓存读取仅需 1/10 成本。

**一句话总结**：Anthropic 提示缓存，缓存系统提示/工具/文档，读取成本仅 1/10，5 分钟 TTL 自动续期，生产级 Claude 应用必用，最高省 90%。

## 可缓存内容

1. 系统提示 — 最常见场景
2. 工具定义 — 大量工具时效果显著
3. RAG 文档 — 同一文档多轮问答
4. 多轮对话前缀 — 早期上下文缓存

## 最佳实践

- 最长最稳定的内容先缓存
- 缓存内容必须是前缀
- 监控 cache_read_input_tokens 确认命中
- 最小 1024 tokens

## 常见问题

**Q: 影响质量？**
A: 不影响，模型看到的输入完全一致。

**Q: Claude Code 用了吗？**
A: 用了，自动缓存 CLAUDE.md 和工具定义。

## 来源与致谢

> [Anthropic Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)

---
Source: https://tokrepo.com/en/workflows/ed25d3cb-413d-40e5-8c36-a063e8a5ca99
Author: Prompt Lab