LLM Observability
Langfuse, AgentOps, LangSmith, Phoenix — the dashboards that catch token blow-ups before your CFO does.
What's in this pack
You can't fix what you can't see. The day a prompt regression silently 3x's your token bill is the day you wish you'd installed an observability layer last quarter. This pack collects the seven assets that turn an opaque LLM black box into a debuggable, alertable, optimizable system.
| # | Asset | Tier | What it does |
|---|---|---|---|
| 1 | Langfuse | open-source | full traces, eval, prompt management — self-host or cloud |
| 2 | AgentOps | open-source | agent-specific observability with session replay |
| 3 | Arize Phoenix | open-source | OpenInference traces with built-in evaluators |
| 4 | LangSmith | hosted | LangChain's first-party tracing & dataset platform |
| 5 | Token cost dashboards | pattern | per-user, per-feature, per-prompt-version breakdown |
| 6 | Latency budget alerts | pattern | p95 / p99 with PagerDuty wiring |
| 7 | Prompt version diffs | pattern | side-by-side trace replay across two prompt versions |
Why this matters
Three production failure modes that observability catches and intuition misses:
- Silent token inflation. A "minor" prompt edit adds a 200-token reminder. Multiply by 1M requests/day and that's $2-6k/mo extra you didn't budget for. Langfuse's per-prompt-version cost view surfaces it on day one.
- The 95th-percentile tail. Average latency looks fine — but the 5% of queries hitting cold cache, retry loops, or oversized RAG payloads tank user experience. p99 dashboards from Phoenix or LangSmith make the tail visible.
- Quality regression invisible at the unit level. Each individual response looks plausible. Aggregate evaluator scores (LLM-as-judge, retrieval recall, hallucination rate) over the last 24h vs the previous 7d, and the regression jumps out.
Install in one command
# Install the entire pack
tokrepo install pack/llm-observability
# Or pick the platform you want to start with
tokrepo install langfuse
tokrepo install agentops
tokrepo install phoenix
The TokRepo CLI drops the SDK config and dashboard scaffolding into your project so traces start flowing on the next request — no manual instrumentation walk-through required.
Common pitfalls
- Logging full prompts and PII to a third-party SaaS. If your prompts include user data, self-host Langfuse or Phoenix; don't ship raw payloads to LangSmith Cloud without redaction. All three open-source options run on a single VM under 4GB RAM for typical loads.
- No sampling on high-volume endpoints. Tracing 100% of requests at 1M/day will overwhelm both your storage and your wallet. Sample 10% by default, 100% on errors. Langfuse and Phoenix both support this natively.
- Tracking tokens but not dollars. Different models price differently per token. Configure model-pricing in your platform once; track cost in dollars, not just token counts. CFOs care about dollars.
- One generic dashboard for everyone. Build one dashboard per persona — eng (latency, error rate), product (cost per feature), exec (cost per active user, week-over-week trend). Generic dashboards get ignored.
- No alert on prompt-version cost delta. Add an alert that fires when a new prompt version's avg-cost-per-call deviates >20% from the previous version. This is the single highest-ROI alert you'll set up.
Relationship to other packs
LLM Observability is the runtime telemetry layer. The complementary LLM Eval & Guardrails pack is the offline scoring layer — DeepEval, Promptfoo, Ragas. You want both: observability shows you what's happening in production, eval tells you whether a proposed change is better before you ship.
Multi-Agent Frameworks (CAMEL, LangGraph, DeepAgents) are the systems being instrumented. If you're running a LangGraph workflow and can't see which node failed, you don't have observability — you have a print-statement debugger. Pair the framework pack with this one from day one.
7 assets in this pack
Frequently asked questions
Is this stuff free?
Langfuse, Phoenix, and AgentOps are open-source under MIT/Apache 2.0 and run on a single VM. Self-hosted is free; you only pay for storage and compute. LangSmith is hosted-only and metered per trace — free tier covers small teams, prices scale to enterprise. For most teams the right answer is start with self-hosted Langfuse, switch to LangSmith only if you're already deep in the LangChain ecosystem and want first-party integration.
How does Langfuse compare to LangSmith?
Langfuse is open-source, self-hostable, and framework-agnostic — it works with LangChain, LlamaIndex, raw OpenAI SDK, custom code. LangSmith is closed-source, hosted, and tightly coupled to LangChain. Feature-wise they're roughly equivalent on tracing and prompt management; LangSmith has a slight edge on LangChain-specific features, Langfuse has a stronger evaluator framework and self-host story. Pick Langfuse if data sovereignty matters, LangSmith if you want zero-ops and are LangChain-native.
Will this work with Cursor or Codex CLI?
Observability is at the API call level, not the editor level — so any tool that hits an LLM API can be instrumented. The TokRepo install adds SDK init code to your project. If you're proxying through Claude Code, Cursor, or Codex CLI, instrument the agent backend (the framework or service that calls the LLM), not the editor. Each platform's SDK is a 5-line import.
What's the difference vs the LLM Eval pack?
Eval is offline scoring — given a prompt and a reference answer, how good is the output. Observability is runtime telemetry — what happened in production: latency, cost, errors, traces. Eval feeds CI; observability feeds dashboards and alerts. You need both. A common pattern: eval scores from your golden set get logged into your observability platform so quality, cost, and latency live on the same dashboard.
How much instrumentation overhead does this add?
Async batched logging adds ~1-3ms p50 latency to LLM calls — negligible compared to the model latency itself (often 500-3000ms). All four platforms ship async SDKs that batch traces in the background. Set sampling to 10% on high-volume endpoints to keep storage costs sane. The actual hot-path overhead is so low that there's no good reason to ship without observability.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs