# Helicone Cache — Cut LLM Spend with Drop-In Response Caching > Helicone Cache short-circuits identical LLM requests at the proxy. Set Helicone-Cache-Enabled header, exact-match responses come back in ms at zero cost. ## Install Copy the content below into your project: ## Quick Use 1. Already have Helicone proxy URL set in your LLM SDK? 2. Add header `Helicone-Cache-Enabled: true` 3. Optional: `Cache-Control: max-age=3600` to set TTL --- ## Intro Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute. --- ### Enable cache ```python from openai import OpenAI client = OpenAI( base_url="https://oai.helicone.ai/v1", default_headers={ "Helicone-Auth": f"Bearer {HELICONE_KEY}", "Helicone-Cache-Enabled": "true", "Cache-Control": "max-age=3600", # cache for 1 hour }, ) # First call hits the LLM resp1 = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], ) # Second identical call returns from cache — same content, $0 resp2 = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], ) ``` The response includes a `Helicone-Cache: HIT` header so you know which calls were free. ### Per-request override ```python # Override TTL for one call extra_headers = {"Cache-Control": "max-age=86400"} # 24h for this one ``` ### Bucket size for diversity ```python # Allow 3 distinct cached responses per prompt (round-robin) extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"} ``` Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time. ### What gets cached Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for: - Stable system prompts (e.g. classification with fixed instructions) - Batch evaluations on a fixed set of inputs - Internal tooling (slack bots, etc) that asks repeated questions Not useful for high-temperature creative generation where you actually want variety. --- ### FAQ **Q: Is Helicone Cache free?** A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month. **Q: How does this differ from prompt caching (Anthropic / OpenAI)?** A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings. **Q: Can I see cache hit rate?** A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature). --- ## Source & Thanks > Built by [Helicone](https://github.com/Helicone). Licensed under Apache-2.0. > > [Helicone/helicone](https://github.com/Helicone/helicone) — ⭐ 4,000+ --- ## 快速使用 1. LLM SDK 已经设了 Helicone proxy URL? 2. 加 header `Helicone-Cache-Enabled: true` 3. 可选:`Cache-Control: max-age=3600` 设 TTL --- ## 简介 Helicone Cache 在代理层短路相同的 LLM 请求 —— 同 prompt + 同模型 = 缓存响应、不上游调用、零 LLM 成本。设一个 header,命中缓存亚毫秒响应。适合相同 prompt 重复的生产应用(系统指令、常见查询、批量 eval)。兼容 Helicone 代理的所有 LLM provider。装机时间 1 分钟。 --- ### 启用缓存 ```python from openai import OpenAI client = OpenAI( base_url="https://oai.helicone.ai/v1", default_headers={ "Helicone-Auth": f"Bearer {HELICONE_KEY}", "Helicone-Cache-Enabled": "true", "Cache-Control": "max-age=3600", # 缓存 1 小时 }, ) # 第一次调用打到 LLM resp1 = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], ) # 第二次相同调用走缓存 —— 同内容,$0 resp2 = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], ) ``` 响应带 `Helicone-Cache: HIT` header,能看到哪次免费。 ### 单请求覆盖 ```python # 这一个请求覆盖 TTL extra_headers = {"Cache-Control": "max-age=86400"} # 这次 24h ``` ### Bucket size 提供多样性 ```python # 同一 prompt 允许 3 种缓存响应(轮转) extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"} ``` 在常见 prompt(比如 greeting 消息)上想要点变化、又不想每次都付推理费用时有用。 ### 什么会被缓存 缓存 key = method + URL + body(模型、消息、temperature 等)。任何参数变就是 miss。适合: - 稳定的系统 prompt(比如带固定指令的分类) - 固定输入集上的批量 eval - 反复问相同问题的内部工具(slack bot 等) 不适合高 temperature 的创意生成,那种你就是要多样性。 --- ### FAQ **Q: Helicone Cache 免费吗?** A: 免费 —— Cache 是 Helicone 免费档的一部分。缓存响应算进请求配额但不触发上游 LLM 费用。免费档覆盖每月 10K 缓存请求。 **Q: 跟原生 prompt caching(Anthropic / OpenAI)啥区别?** A: 原生 prompt caching 复用 prompt 的前缀来减输入 token 成本。Helicone Cache 在 prompt 完全相同时短路整次调用,返回之前的完整响应。互补 —— 都用省最多。 **Q: 能看缓存命中率吗?** A: 能 —— Helicone 仪表盘按项目、模型、时间看缓存命中/miss。用它找应该缓存的 prompt(高重复、高成本)和不该缓存的(低重复、高 temperature)。 --- ## 来源与感谢 > Built by [Helicone](https://github.com/Helicone). Licensed under Apache-2.0. > > [Helicone/helicone](https://github.com/Helicone/helicone) — ⭐ 4,000+ --- Source: https://tokrepo.com/en/workflows/helicone-cache-cut-llm-spend-with-drop-in-response-caching Author: Helicone