TOKREPO · ARSENAL
New · this week

AI Cost Optimization — Token-Saving Engineering Stack

Ten picks for SaaS / agent teams whose LLM bill is now a real line item — LiteLLM, OpenRouter, Manifest router, Portkey, Helicone cache, Cloudflare AI Gateway, LLMLingua compression, TokenCost calculator, LiteLLM cost dashboard, Fireworks fine-tune. Five layers: measure first, then cache, route, compress, fine-tune. 10–50% savings typical without quality loss.

10 assets

What's in this pack

When the monthly LLM invoice crosses five figures, every engineer suddenly has an opinion about caching. This pack is the boring, ordered playbook that actually moves the number: measure before you optimize, cache before you route, route before you compress, compress before you fine-tune. Most teams that touch every layer in order book 10–50% savings without users noticing — the upper end shows up in support chat (high cache hit) and bulk classification (small fine-tuned models), not in greenfield agent reasoning where you should be careful.

# Asset Layer What it does
1 LiteLLM Proxy router one OpenAI-compatible endpoint to 100+ providers, fallback chains, per-key budgets
2 OpenRouter Unified API router hosted gateway over 300+ models, single key, automatic failover
3 Manifest Smart Router router semantic routing — cheap model first, escalate only when confidence is low
4 Portkey AI Gateway router enterprise gateway, 250+ LLMs, virtual keys, guardrails, caching
5 Helicone Cache cache drop-in response cache via proxy header; deterministic and semantic modes
6 Cloudflare AI Gateway cache edge-level proxy with cache, analytics, retries, rate limits — free tier
7 LLMLingua compression up to 20× prompt compression with minimal task-quality loss
8 LiteLLM Cost Dashboard monitoring per-project, per-user, per-model spend tracking with hard budget blocks
9 TokenCost monitoring offline calculator for 400+ models — sanity-check estimates before shipping
10 Fireworks Serverless LoRA fine-tune serverless LoRA on Llama in 30 minutes — replace a frontier model for a narrow task

Install in this order — measure first, fine-tune last

# Layer 1: measure (do this before changing anything)
tokrepo install tokencost                  # offline price model
tokrepo install litellm-cost-tracking      # live per-project dashboard

# Layer 2: cache (highest ROI, lowest risk for repetitive workloads)
tokrepo install helicone-cache             # drop-in response cache
# or: tokrepo install cloudflare-ai-gateway  # edge cache, free tier

# Layer 3: route (cheap-model-first with safe escalation)
tokrepo install litellm-proxy              # self-hosted, BYOK
# or: tokrepo install openrouter-unified-api  # hosted, 300+ models
tokrepo install manifest-smart-router      # semantic router on top

# Layer 4: compress (only after layers 1–3 are baselined)
tokrepo install llmlingua                  # 2–20× prompt compression

# Layer 5: fine-tune (last resort — costs engineering time)
tokrepo install fireworks-fine-tune        # LoRA on Llama, narrow task

The order matters more than the picks. Skip ahead and you'll either burn engineering time fine-tuning a model that a cache would have replaced, or you'll silently degrade quality with prompt compression you can't attribute because you never instrumented the baseline. The unglamorous truth: the biggest savings usually come from layer 2 (cache) and layer 3 (route), not the fashionable layer 5.

Layer 1 — measure. Install TokenCost as a library so every PR prints before/after token math on staging. Install LiteLLM cost tracking (or Portkey) so production has a per-call ledger by project / user / model. Don't move on until you can answer 'what does one user session cost' to two significant figures.

Layer 2 — cache. Helicone gives you exact-match caching via a single proxy header and semantic caching via embedding similarity. Cloudflare AI Gateway gives you the same at the edge with a free tier. For chatbots, FAQ, RAG Q&A, and idempotent classification, hit rates of 30–70% are realistic. For agent planning loops and creative generation, they are not — don't try.

Layer 3 — route. LiteLLM Proxy is the self-hosted default — one OpenAI-compatible URL maps to Anthropic, Bedrock, Vertex, OpenAI, with failover chains and per-key budgets. OpenRouter is the hosted equivalent. Manifest sits on top to classify prompts and route cheap models first, escalating only on low confidence. Portkey adds enterprise features (SSO, audit, virtual keys, guardrails).

Layer 4 — compress. LLMLingua compresses prompts up to 20× by token-level importance scoring. Quality loss depends on task: tolerated on summarization, extraction, and classification; risky on math, code generation, and complex reasoning. Always A/B against an eval suite. Treat compression ratio as a budget, not a target.

Layer 5 — fine-tune. Fireworks serverless LoRA on Llama replaces a frontier model for one narrow, high-volume task in roughly 30 minutes of training. Worth doing when you have ≥10k labeled or LLM-generated examples, the task is stable, and the frontier bill on that one task justifies the engineering time. Don't fine-tune to save 5% on a low-volume endpoint.

How they fit together

client app
   │
   ▼
┌──────────────┐  cache hit   ┌────────────┐
│ Helicone /   │─────────────▶│ cached     │
│ Cloudflare   │              │ response   │
│ AI Gateway   │              └────────────┘
└──────┬───────┘ cache miss
       │
       ▼
┌──────────────────┐  classify  ┌─────────────────┐
│ Manifest router  │───────────▶│ cheap model     │
│ (semantic)       │            │ (Llama / Haiku) │
└──────┬───────────┘            └─────────────────┘
       │ low confidence / escalate
       ▼
┌─────────────────┐  optional   ┌──────────────┐
│ LiteLLM /       │────────────▶│ LLMLingua    │
│ OpenRouter /    │  compress   │ pre-compress │
│ Portkey gateway │             └──────┬───────┘
└──────┬──────────┘                    │
       │                               │
       ▼                               ▼
  frontier model (Opus / GPT-4 / Gemini Ultra)
       │
       ▼
LiteLLM cost ledger + TokenCost reconciliation

Tradeoffs (the honest part)

  • Cache hit rate vs freshness. 70% hit on support chat is a fortune; 70% on a stock-price assistant is a customer-trust disaster. Set TTL per route, not globally.
  • Router latency overhead. A semantic router adds 50–200 ms (embedding + classification). Invisible on chat, visible on streaming voice. Measure end-to-end p95 before and after.
  • Compression quality loss. LLMLingua at 5× is mostly free on summarization; at 20× it starts dropping facts on extraction. Pair every rollout with a held-out eval set.
  • Cheap-model misrouting. Routing a math problem to Haiku because the router thought it was 'simple Q&A' is a silent regression that surfaces a week later. Log routing decisions with the trace, review bottom-decile confidence weekly.
  • Fine-tune lock-in. A LoRA against Llama 3.1 70B is yours to host anywhere; a fine-tune on a proprietary model isn't. Pick the base deliberately.
  • Observability isn't free either. Break-even is usually around 1M calls/month — below that, the free tiers are fine.

Common pitfalls

  • Optimizing before measuring. Engineers fine-tune 'because GPT-4 is expensive' without ever instrumenting the top-spend endpoint. Eight times out of ten the bill is one feature, not the whole product.
  • Caching private content by mistake. A semantic cache keyed on prompt text alone will serve user A's medical chat to user B if queries embed similarly. Always scope cache key by user / tenant / auth-context.
  • Routing the wrong task to the wrong model. Tool use and JSON-mode structured output break on many cheap models. Run the routing classifier against real production traffic distribution before rollout.
  • Confusing 'tokens saved' with 'dollars saved'. Input tokens are 3–5× cheaper than output tokens at most providers. Track dollars, not tokens.
  • Treating cost optimization as one-shot. Provider prices change monthly; new cheaper models ship quarterly. Re-run the routing benchmark every quarter.
  • Skipping the eval gate. Every layer 3–5 change must ship behind an eval suite. 'Save 20% on tokens, lose 4% on accuracy' is rarely the trade you wanted.

Pair with these packs

This pack is the cost layer. Pair with Agent Observability + Tracing for debugging — you cannot optimize spend you can't attribute to a span. Pair with LLM Eval & Guardrails so every routing and compression change ships behind a quality gate. Pair with Vector DB + RAG if retrieval context is what is making prompts long; sometimes the cheapest token is the one you don't send.

INSTALL · ONE COMMAND
$ tokrepo install pack/ai-cost-optimization-stack
hand it to your agent — or paste it in your terminal
What's inside

10 assets in this pack

Agent#01
LiteLLM Proxy — Unified Gateway for 100+ LLM APIs

LiteLLM Proxy maps 100+ LLM providers (Anthropic, OpenAI, Bedrock, Vertex) to one OpenAI-compatible endpoint. Auth, rate limit, cost track, fallbacks.

by LiteLLM (BerriAI)·92 views
$ tokrepo install litellm-proxy-unified-gateway-for-100-llm-apis
Skill#02
OpenRouter — Unified API for 300+ LLMs with Auto Failover

OpenRouter is one OpenAI-compatible endpoint for 300+ LLMs across 60+ providers. Transparent pricing, no markup, automatic failover when a route is down.

by OpenRouter·96 views
$ tokrepo install openrouter-unified-api-for-300-llms-with-auto-failover
Skill#03
Manifest — Smart LLM Router That Cuts Costs 70%

Intelligent LLM routing that scores requests across 23 dimensions in under 2ms. Routes to the cheapest capable model among 300+ options from 13+ providers. MIT, 4,200+ stars.

by AI Open Source·185 views
$ tokrepo install manifest-smart-llm-router-cuts-costs-70-15266cba
Skill#04
Portkey AI Gateway — Route to 250+ LLMs

Portkey AI Gateway routes to 250+ LLMs with sub-1ms latency, 40+ guardrails, retries, fallbacks, and caching. 11.1K+ stars. Apache 2.0.

by AI Open Source·143 views
$ tokrepo install portkey-ai-gateway-route-250-llms-585d3a26
Skill#05
Helicone Cache — Cut LLM Spend with Drop-In Response Caching

Helicone Cache short-circuits identical LLM requests at the proxy. Set Helicone-Cache-Enabled header, exact-match responses come back in ms at zero cost.

by Helicone·112 views
$ tokrepo install helicone-cache-cut-llm-spend-with-drop-in-response-caching
Skill#06
Cloudflare AI Gateway — LLM Proxy, Cache & Analytics

Free proxy gateway for LLM API calls with caching, rate limiting, cost tracking, and fallback routing across providers. Reduce costs up to 95% with response caching. 7,000+ stars.

by Cloudflare·173 views
$ tokrepo install cloudflare-ai-gateway-llm-proxy-cache-analytics-b1962c77
Prompt#07
LLMLingua — Compress Prompts 20x with Minimal Loss

Microsoft research tool for prompt compression. Reduce token usage up to 20x while maintaining LLM performance. Solves lost-in-the-middle for RAG. MIT, 6,000+ stars.

by Script Depot·249 views
$ tokrepo install llmlingua-compress-prompts-20x-minimal-loss-1510da0c
Skill#08
LiteLLM Cost Tracking — Per-Project LLM Spend Dashboard

LiteLLM ships a built-in cost dashboard. Track LLM spend by project, user, model, tag. Hard budgets that block at the proxy. SOC2 / SSO via Pro tier.

by LiteLLM (BerriAI)·75 views
$ tokrepo install litellm-cost-tracking-per-project-llm-spend-dashboard
Skill#09
TokenCost — LLM Price Calculator for 400+ Models

Client-side token counting and USD cost estimation for 400+ LLMs. 3 lines of Python to track prompt and completion costs. Supports OpenAI, Anthropic, Mistral, AWS Bedrock. MIT, 2K+ stars.

by Script Depot·181 views
$ tokrepo install tokencost-llm-price-calculator-400-models-43b26691
Skill#10
Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

Fireworks runs serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Upload JSONL, get a deployed fine-tune in 30 min on the same endpoint.

by Fireworks AI·49 views
$ tokrepo install fireworks-fine-tuning-serverless-lora-on-llama-in-30-min
FAQ

Frequently asked questions

How much can I actually save with this stack?

10–50% is the honest range for most production workloads, and it's heavily workload-dependent. The lower end (10–20%) is what you get from layer 2–3 alone on a typical mixed-traffic API. The upper end (40–50%) shows up in two specific shapes: high-cache-hit chat / FAQ / support workloads where Helicone or Cloudflare AI Gateway catches the long tail of repeats, and high-volume narrow tasks (classification, extraction) where a fine-tuned small model replaces a frontier call. Anyone quoting '70%+' without naming the workload is either selling you something or has a specific case (e.g. 95% cache hit on a stable bot) that won't generalize. Measure your own baseline first.

Is semantic caching safe for private or multi-tenant data?

Only if you scope the cache key correctly. The default Helicone / GPTCache / Cloudflare semantic cache keys on prompt content; if user A asks 'what's my balance' and user B asks the same phrasing, embeddings will match and the cache will serve A's answer to B. Always add user_id, tenant_id, or auth-context to the cache key, and never cache content that contains PII in the response body. For regulated industries (health, finance) keep semantic caching off the user-data path entirely and only cache system-side things like documentation lookups and tool descriptions.

When is fine-tuning actually worth the engineering time?

Fine-tuning pays off when three things are true: (1) the task is stable — you're not still iterating on the prompt every week, (2) you have ≥10k labeled examples or can generate them from frontier model traces, and (3) the frontier-model bill on that single task alone justifies 1–2 engineer-weeks plus ongoing eval cost. Classic wins: PII extraction, intent classification, structured-data extraction from semi-structured docs, in-domain summarization. Classic losses: 'general agent reasoning', 'creative writing', anything where the prompt or task definition is still moving. Fireworks serverless LoRA on Llama keeps your weights portable — pick that over a closed fine-tune unless you have a specific reason.

OpenRouter vs LiteLLM — which one should I pick?

OpenRouter is the hosted answer: one API key, 300+ models, automatic failover, you pay them a small markup and they handle the multi-provider plumbing. LiteLLM is the self-hosted answer: you run the proxy (or use it as a Python library), bring your own provider keys, and pay only the underlying model cost. Pick OpenRouter if you want one bill, fast time-to-value, and don't want to operate a proxy. Pick LiteLLM if you have direct provider contracts (often cheaper at scale), care about data sovereignty, want a per-project cost dashboard, or are already running infra. Many teams use both: LiteLLM for production critical paths, OpenRouter for prototyping and rare-model access.

What's the cheapest way to start monitoring cost?

Start with TokenCost — it's a free offline library that handles 400+ models and lets you print before/after estimates in any script or PR. For production, the cheapest live option is Cloudflare AI Gateway's free tier (cache + analytics + per-model breakdown, no SDK install — you just point your base_url at it) or self-hosted Langfuse / Helicone open source. If you're already on LiteLLM Proxy, its built-in cost tracking is the path of least resistance — same proxy, no extra service. Hosted Helicone, Portkey, and Datadog LLM Observability are all good but only worth paying for once you're above roughly 1M calls/month.

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs