TOKREPO · ARSENAL
Stable

LLM Observability

Langfuse, AgentOps, LangSmith, Phoenix — the dashboards that catch token blow-ups before your CFO does.

7 assets

What's in this pack

You can't fix what you can't see. The day a prompt regression silently 3x's your token bill is the day you wish you'd installed an observability layer last quarter. This pack collects the seven assets that turn an opaque LLM black box into a debuggable, alertable, optimizable system.

# Asset Tier What it does
1 Langfuse open-source full traces, eval, prompt management — self-host or cloud
2 AgentOps open-source agent-specific observability with session replay
3 Arize Phoenix open-source OpenInference traces with built-in evaluators
4 LangSmith hosted LangChain's first-party tracing & dataset platform
5 Token cost dashboards pattern per-user, per-feature, per-prompt-version breakdown
6 Latency budget alerts pattern p95 / p99 with PagerDuty wiring
7 Prompt version diffs pattern side-by-side trace replay across two prompt versions

Why this matters

Three production failure modes that observability catches and intuition misses:

  1. Silent token inflation. A "minor" prompt edit adds a 200-token reminder. Multiply by 1M requests/day and that's $2-6k/mo extra you didn't budget for. Langfuse's per-prompt-version cost view surfaces it on day one.
  2. The 95th-percentile tail. Average latency looks fine — but the 5% of queries hitting cold cache, retry loops, or oversized RAG payloads tank user experience. p99 dashboards from Phoenix or LangSmith make the tail visible.
  3. Quality regression invisible at the unit level. Each individual response looks plausible. Aggregate evaluator scores (LLM-as-judge, retrieval recall, hallucination rate) over the last 24h vs the previous 7d, and the regression jumps out.

Install in one command

# Install the entire pack
tokrepo install pack/llm-observability

# Or pick the platform you want to start with
tokrepo install langfuse
tokrepo install agentops
tokrepo install phoenix

The TokRepo CLI drops the SDK config and dashboard scaffolding into your project so traces start flowing on the next request — no manual instrumentation walk-through required.

Common pitfalls

  • Logging full prompts and PII to a third-party SaaS. If your prompts include user data, self-host Langfuse or Phoenix; don't ship raw payloads to LangSmith Cloud without redaction. All three open-source options run on a single VM under 4GB RAM for typical loads.
  • No sampling on high-volume endpoints. Tracing 100% of requests at 1M/day will overwhelm both your storage and your wallet. Sample 10% by default, 100% on errors. Langfuse and Phoenix both support this natively.
  • Tracking tokens but not dollars. Different models price differently per token. Configure model-pricing in your platform once; track cost in dollars, not just token counts. CFOs care about dollars.
  • One generic dashboard for everyone. Build one dashboard per persona — eng (latency, error rate), product (cost per feature), exec (cost per active user, week-over-week trend). Generic dashboards get ignored.
  • No alert on prompt-version cost delta. Add an alert that fires when a new prompt version's avg-cost-per-call deviates >20% from the previous version. This is the single highest-ROI alert you'll set up.

Relationship to other packs

LLM Observability is the runtime telemetry layer. The complementary LLM Eval & Guardrails pack is the offline scoring layer — DeepEval, Promptfoo, Ragas. You want both: observability shows you what's happening in production, eval tells you whether a proposed change is better before you ship.

Multi-Agent Frameworks (CAMEL, LangGraph, DeepAgents) are the systems being instrumented. If you're running a LangGraph workflow and can't see which node failed, you don't have observability — you have a print-statement debugger. Pair the framework pack with this one from day one.

INSTALL · ONE COMMAND
$ tokrepo install pack/llm-observability
hand it to your agent — or paste it in your terminal
What's inside

7 assets in this pack

Config#01
Langfuse — Open Source LLM Observability

Langfuse is an open-source LLM engineering platform for tracing, prompt management, evaluation, and debugging AI apps. 24.1K+ GitHub stars. Self-hosted or cloud. MIT.

by AI Open Source·100 views
$ tokrepo install langfuse-open-source-llm-observability-49a8eb0b
Script#02
AgentOps — Observability for AI Agents

Python SDK for AI agent monitoring. LLM cost tracking, session replay, benchmarking, and error analysis. Integrates with CrewAI, LangChain, AutoGen, and more. 5.4K+ stars.

by Script Depot·98 views
$ tokrepo install agentops-observability-ai-agents-d570c84f
Prompt#03
LangSmith — Prompt Debugging and LLM Observability

Debug, test, and monitor LLM applications in production. LangSmith provides trace visualization, prompt playground, dataset evaluation, and regression testing for AI.

by Prompt Lab·93 views
$ tokrepo install langsmith-prompt-debugging-llm-observability-4d9432ea
Config#04
Phoenix — Open Source AI Observability

Phoenix is an AI observability platform for tracing, evaluating, and debugging LLM apps. 9.1K+ stars. OpenTelemetry, evals, prompt management.

by AI Open Source·89 views
$ tokrepo install phoenix-open-source-ai-observability-42fa8573
Config#05
OpenLIT — OpenTelemetry LLM Observability

Monitor LLM costs, latency, and quality with OpenTelemetry-native tracing. GPU monitoring and guardrails built in. 2.3K+ stars.

by AI Open Source·75 views
$ tokrepo install openlit-opentelemetry-llm-observability-13e3c714
Config#06
Langtrace — Open Source AI Observability Platform

Open-source observability for LLM apps. Trace OpenAI, Anthropic, and LangChain calls with OpenTelemetry-native instrumentation and a real-time dashboard.

by AI Open Source·75 views
$ tokrepo install langtrace-open-source-ai-observability-platform-a53444d6
Skill#07
Gemini CLI Extension: Observability — Monitoring & Logs

Gemini CLI extension for Google Cloud observability. Set up monitoring, analyze logs, create dashboards, and configure alerts.

by Google · Gemini Team·102 views
$ tokrepo install gemini-cli-extension-observability-monitoring-logs-aa41279c
FAQ

Frequently asked questions

Is this stuff free?

Langfuse, Phoenix, and AgentOps are open-source under MIT/Apache 2.0 and run on a single VM. Self-hosted is free; you only pay for storage and compute. LangSmith is hosted-only and metered per trace — free tier covers small teams, prices scale to enterprise. For most teams the right answer is start with self-hosted Langfuse, switch to LangSmith only if you're already deep in the LangChain ecosystem and want first-party integration.

How does Langfuse compare to LangSmith?

Langfuse is open-source, self-hostable, and framework-agnostic — it works with LangChain, LlamaIndex, raw OpenAI SDK, custom code. LangSmith is closed-source, hosted, and tightly coupled to LangChain. Feature-wise they're roughly equivalent on tracing and prompt management; LangSmith has a slight edge on LangChain-specific features, Langfuse has a stronger evaluator framework and self-host story. Pick Langfuse if data sovereignty matters, LangSmith if you want zero-ops and are LangChain-native.

Will this work with Cursor or Codex CLI?

Observability is at the API call level, not the editor level — so any tool that hits an LLM API can be instrumented. The TokRepo install adds SDK init code to your project. If you're proxying through Claude Code, Cursor, or Codex CLI, instrument the agent backend (the framework or service that calls the LLM), not the editor. Each platform's SDK is a 5-line import.

What's the difference vs the LLM Eval pack?

Eval is offline scoring — given a prompt and a reference answer, how good is the output. Observability is runtime telemetry — what happened in production: latency, cost, errors, traces. Eval feeds CI; observability feeds dashboards and alerts. You need both. A common pattern: eval scores from your golden set get logged into your observability platform so quality, cost, and latency live on the same dashboard.

How much instrumentation overhead does this add?

Async batched logging adds ~1-3ms p50 latency to LLM calls — negligible compared to the model latency itself (often 500-3000ms). All four platforms ship async SDKs that batch traces in the background. Set sampling to 10% on high-volume endpoints to keep storage costs sane. The actual hot-path overhead is so low that there's no good reason to ship without observability.

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs