# Helicone Cache — Cut LLM Spend with Drop-In Response Caching

> Helicone Cache short-circuits identical LLM requests at the proxy. Set Helicone-Cache-Enabled header, exact-match responses come back in ms at zero cost.

## Install

Copy the content below into your project:

## Quick Use

1. Already have Helicone proxy URL set in your LLM SDK?
2. Add header `Helicone-Cache-Enabled: true`
3. Optional: `Cache-Control: max-age=3600` to set TTL

---

## Intro

Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute.

---

### Enable cache

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # cache for 1 hour
    },
)

# First call hits the LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# Second identical call returns from cache — same content, $0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
```

The response includes a `Helicone-Cache: HIT` header so you know which calls were free.

### Per-request override

```python
# Override TTL for one call
extra_headers = {"Cache-Control": "max-age=86400"}  # 24h for this one
```

### Bucket size for diversity

```python
# Allow 3 distinct cached responses per prompt (round-robin)
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}
```

Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time.

### What gets cached

Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for:
- Stable system prompts (e.g. classification with fixed instructions)
- Batch evaluations on a fixed set of inputs
- Internal tooling (slack bots, etc) that asks repeated questions

Not useful for high-temperature creative generation where you actually want variety.

---

### FAQ

**Q: Is Helicone Cache free?**
A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month.

**Q: How does this differ from prompt caching (Anthropic / OpenAI)?**
A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings.

**Q: Can I see cache hit rate?**
A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature).

---

## Source & Thanks

> Built by [Helicone](https://github.com/Helicone). Licensed under Apache-2.0.
>
> [Helicone/helicone](https://github.com/Helicone/helicone) — ⭐ 4,000+

---

<!-- ZH -->

## 快速使用

1. LLM SDK 已经设了 Helicone proxy URL？
2. 加 header `Helicone-Cache-Enabled: true`
3. 可选：`Cache-Control: max-age=3600` 设 TTL

---

## 简介

Helicone Cache 在代理层短路相同的 LLM 请求 —— 同 prompt + 同模型 = 缓存响应、不上游调用、零 LLM 成本。设一个 header，命中缓存亚毫秒响应。适合相同 prompt 重复的生产应用（系统指令、常见查询、批量 eval）。兼容 Helicone 代理的所有 LLM provider。装机时间 1 分钟。

---

### 启用缓存

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # 缓存 1 小时
    },
)

# 第一次调用打到 LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# 第二次相同调用走缓存 —— 同内容，$0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
```

响应带 `Helicone-Cache: HIT` header，能看到哪次免费。

### 单请求覆盖

```python
# 这一个请求覆盖 TTL
extra_headers = {"Cache-Control": "max-age=86400"}  # 这次 24h
```

### Bucket size 提供多样性

```python
# 同一 prompt 允许 3 种缓存响应（轮转）
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}
```

在常见 prompt（比如 greeting 消息）上想要点变化、又不想每次都付推理费用时有用。

### 什么会被缓存

缓存 key = method + URL + body（模型、消息、temperature 等）。任何参数变就是 miss。适合：
- 稳定的系统 prompt（比如带固定指令的分类）
- 固定输入集上的批量 eval
- 反复问相同问题的内部工具（slack bot 等）

不适合高 temperature 的创意生成，那种你就是要多样性。

---

### FAQ

**Q: Helicone Cache 免费吗？**
A: 免费 —— Cache 是 Helicone 免费档的一部分。缓存响应算进请求配额但不触发上游 LLM 费用。免费档覆盖每月 10K 缓存请求。

**Q: 跟原生 prompt caching（Anthropic / OpenAI）啥区别？**
A: 原生 prompt caching 复用 prompt 的前缀来减输入 token 成本。Helicone Cache 在 prompt 完全相同时短路整次调用，返回之前的完整响应。互补 —— 都用省最多。

**Q: 能看缓存命中率吗？**
A: 能 —— Helicone 仪表盘按项目、模型、时间看缓存命中/miss。用它找应该缓存的 prompt（高重复、高成本）和不该缓存的（低重复、高 temperature）。

---

## 来源与感谢

> Built by [Helicone](https://github.com/Helicone). Licensed under Apache-2.0.
>
> [Helicone/helicone](https://github.com/Helicone/helicone) — ⭐ 4,000+


---
Source: https://tokrepo.com/en/workflows/helicone-cache-cut-llm-spend-with-drop-in-response-caching
Author: Helicone