How do I install Helicone Cache — Cut LLM Spend with Drop-In Response Caching?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Helicone Cache — Cut LLM Spend with Drop-In Response Caching

Name: Helicone Cache — Cut LLM Spend with Drop-In Response Caching
Author: Helicone

简介

Helicone Cache 在代理层短路相同的 LLM 请求 —— 同 prompt + 同模型 = 缓存响应、不上游调用、零 LLM 成本。设一个 header，命中缓存亚毫秒响应。适合相同 prompt 重复的生产应用（系统指令、常见查询、批量 eval）。兼容 Helicone 代理的所有 LLM provider。装机时间 1 分钟。

启用缓存

from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # 缓存 1 小时
    },
)

# 第一次调用打到 LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# 第二次相同调用走缓存 —— 同内容，$0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

响应带 Helicone-Cache: HIT header，能看到哪次免费。

单请求覆盖

# 这一个请求覆盖 TTL
extra_headers = {"Cache-Control": "max-age=86400"}  # 这次 24h

Bucket size 提供多样性

# 同一 prompt 允许 3 种缓存响应（轮转）
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}

在常见 prompt（比如 greeting 消息）上想要点变化、又不想每次都付推理费用时有用。

什么会被缓存

缓存 key = method + URL + body（模型、消息、temperature 等）。任何参数变就是 miss。适合：

稳定的系统 prompt（比如带固定指令的分类）
固定输入集上的批量 eval
反复问相同问题的内部工具（slack bot 等）

不适合高 temperature 的创意生成，那种你就是要多样性。

FAQ

Q: Helicone Cache 免费吗？ A: 免费 —— Cache 是 Helicone 免费档的一部分。缓存响应算进请求配额但不触发上游 LLM 费用。免费档覆盖每月 10K 缓存请求。

Q: 跟原生 prompt caching（Anthropic / OpenAI）啥区别？ A: 原生 prompt caching 复用 prompt 的前缀来减输入 token 成本。Helicone Cache 在 prompt 完全相同时短路整次调用，返回之前的完整响应。互补 —— 都用省最多。

Q: 能看缓存命中率吗？ A: 能 —— Helicone 仪表盘按项目、模型、时间看缓存命中/miss。用它找应该缓存的 prompt（高重复、高成本）和不该缓存的（低重复、高 temperature）。

Helicone Cache — Cut LLM Spend with Drop-In Response Caching

简介

启用缓存

单请求覆盖

Bucket size 提供多样性

什么会被缓存

FAQ

来源与感谢

讨论

相关资产

Helicone Sessions — Group LLM Calls by User Conversation

PostHog LLM Observability — Track AI Agents in Production

Weave — Trace and Debug LLM Apps

Cherry Studio Knowledge Base — Local RAG with 50+ Formats