TOKREPO · Arsenal de IA
Nuevo · esta semana

Observabilidad + Tracing de Agentes

Siete picks para el ingeniero ML/LLM que necesita responder '¿por qué hizo eso el agente?' — LangSmith, Langfuse, Phoenix, Helicone, AgentOps, OpenTelemetry for LLM. Trazas span por span de tool calls, retries, sub-agentes y bucles de reflexión — no solo dashboards de costo por prompt.

7 recursos

What's in this pack

The day an agent silently loops between two tools for 47 turns and returns a confident wrong answer is the day you wish you had span-level traces, not a per-prompt cost chart. This pack is built for the ML/LLM engineer trying to reconstruct what an agent actually did: which sub-agent fired, what arguments went into each tool call, how many retries it ate, what the planner thought before it pivoted.

# Asset Tier What it traces
1 LangSmith hosted first-party LangChain / LangGraph spans, dataset replay, eval bridge
2 Langfuse open-source framework-agnostic span trees, prompt versioning, evaluator hooks
3 Arize Phoenix open-source OpenInference spans, built-in retrieval / agent evaluators, notebook-first
4 Helicone hybrid proxy-based tracing, no SDK install, cost + caching + sessions
5 AgentOps open-source agent session replay, tool-call timelines, multi-agent step graphs
6 OpenTelemetry for LLM spec OpenInference + GenAI semantic conventions — vendor-neutral span format
7 Eval-bridged trace store pattern every trace gets a quality score, alerted when score regresses inside a session

How this is different from the LLM Observability pack

If you're not sure which pack to install: LLM Observability is the runtime telemetry layer — token cost, p95 latency, error rate, prompt-version dashboards. The audience is anyone shipping LLM calls to production. Agent Observability + Tracing is the debugging layer for systems where one user request fans out into 10–100 LLM calls, tool invocations, and sub-agent handoffs. The audience is the engineer staring at a 4-minute agent run that returned garbage and trying to figure out which step lied.

A cost dashboard tells you the bill went up 30%. A deep trace tells you the planner sub-agent retried the same search 8 times because the tool returned an empty array and the prompt didn't handle it. You want both, but they answer different questions.

Install in a deliberate order

# Full pack
tokrepo install pack/agent-observability-tracing

# Or layer it up
tokrepo install langfuse        # 1. trace store + UI
tokrepo install opentelemetry-llm  # 2. instrument once, swap backends later
tokrepo install phoenix         # 3. eval bridge for trace-level scoring

Five layers, install in this order:

  1. Instrumenter — wrap your LLM SDK and agent framework. OpenTelemetry + OpenInference semantic conventions is the right default: instrument once, swap backends without code changes. If you're LangChain-native, the built-in langchain_core.tracers writes directly to LangSmith / Langfuse.
  2. Trace store + UI — pick Langfuse self-hosted (data sovereignty), LangSmith Cloud (zero ops, LangChain-tight), Phoenix local (notebook-first, no infra), or Helicone proxy (no SDK install at all). They all consume OTel spans now.
  3. Eval bridge — wire your offline evals (LLM-as-judge, retrieval recall, tool-call correctness) into the same trace store so quality scores land on every span. Phoenix and Langfuse both ship this; LangSmith calls it 'feedback'.
  4. Alerts — fire on trace-level anomalies, not just per-call ones: agent ran >20 steps, retry depth >5, sub-agent never called expected tool, planner output didn't include required schema. These are the failure modes a per-prompt dashboard misses entirely.
  5. Session replay — AgentOps and Helicone both group spans into 'sessions' (one user request = one session). For multi-agent systems this is non-negotiable. Without it you cannot tell two simultaneous user runs apart in the timeline.

Common pitfalls

  • Tracing the LLM call but not the tool call. The model emits a tool call; your code runs the tool; the result feeds the next LLM turn. If you only instrument the LLM SDK, the tool execution is a black hole. Wrap your tool dispatcher with the same tracer.
  • No parent_span_id on sub-agent handoffs. If sub-agent B is spawned by agent A, B's spans must carry A's trace ID. Otherwise the UI shows two disconnected timelines and you cannot answer 'who called whom'.
  • Logging full reasoning chains as a single blob. A reflection loop with 30 thoughts shouldn't be one giant string field — it should be 30 sibling spans under a reflection parent. Filtering, search, and diff all break on the blob shape.
  • Sampling agent traces uniformly. Sample 10% of normal runs, but always keep 100% of runs where the trace exited with an error, hit max retries, or had eval score below threshold. The bugs you need to debug are exactly the runs you'd otherwise drop.
  • Vendor-locking your spans. Use OpenInference / OpenTelemetry GenAI semantic conventions. Every backend in this pack speaks them. Hand-rolling proprietary JSON means rewriting your instrumentation when you migrate.
  • No prompt-version → trace-version link. When a trace was generated by prompt version 7 but you've since shipped version 9, the trace UI must surface the version so you can diff old vs new behavior. Langfuse and LangSmith both support this; wire it on day one.

Pair with these packs

Agent Observability is the debugger. The LLM Observability pack is the production dashboard. The Multi-Agent Frameworks pack is the system being traced (LangGraph, CrewAI, AutoGen). The LLM Eval & Guardrails pack is the scoring engine that turns raw traces into quality signals on the same dashboard. Real teams run all four together — observability without eval is just pretty timelines, eval without traces is just averages.

INSTALAR · UN COMANDO
$ tokrepo install pack/agent-observability-tracing
pásalo a tu agente — o pégalo en tu terminal
Qué incluye

7 recursos listos para instalar

Skill#01
Langfuse — Open Source LLM Observability

Langfuse is an open-source LLM engineering platform for tracing, prompt management, evaluation, and debugging AI apps. 24.1K+ GitHub stars. Self-hosted or cloud. MIT.

by Langfuse·190 views
$ tokrepo install langfuse-open-source-llm-observability-49a8eb0b
Skill#02
AgentOps — Observability for AI Agents

Python SDK for AI agent monitoring. LLM cost tracking, session replay, benchmarking, and error analysis. Integrates with CrewAI, LangChain, AutoGen, and more. 5.4K+ stars.

by Script Depot·161 views
$ tokrepo install agentops-observability-ai-agents-d570c84f
Prompt#03
LangSmith — Prompt Debugging and LLM Observability

Debug, test, and monitor LLM applications in production. LangSmith provides trace visualization, prompt playground, dataset evaluation, and regression testing for AI.

by Prompt Lab·199 views
$ tokrepo install langsmith-prompt-debugging-llm-observability-4d9432ea
Skill#04
Phoenix — Open Source AI Observability

Phoenix is an AI observability platform for tracing, evaluating, and debugging LLM apps. 9.1K+ stars. OpenTelemetry, evals, prompt management.

by Arize AI·175 views
$ tokrepo install phoenix-open-source-ai-observability-42fa8573
Skill#05
OpenLIT — OpenTelemetry LLM Observability

Monitor LLM costs, latency, and quality with OpenTelemetry-native tracing. GPU monitoring and guardrails built in. 2.3K+ stars.

by AI Open Source·150 views
$ tokrepo install openlit-opentelemetry-llm-observability-13e3c714
Skill#06
Langtrace — Open Source AI Observability Platform

Open-source observability for LLM apps. Trace OpenAI, Anthropic, and LangChain calls with OpenTelemetry-native instrumentation and a real-time dashboard.

by AI Open Source·155 views
$ tokrepo install langtrace-open-source-ai-observability-platform-a53444d6
Skill#07
Gemini CLI Extension: Observability — Monitoring & Logs

Gemini CLI extension for Google Cloud observability. Set up monitoring, analyze logs, create dashboards, and configure alerts.

by Google · Gemini Team·212 views
$ tokrepo install gemini-cli-extension-observability-monitoring-logs-aa41279c
Preguntas frecuentes

Preguntas frecuentes

How is this different from the LLM Observability pack?

LLM Observability is the runtime telemetry layer — per-prompt cost, p95 latency, error rate, version-over-version dashboards. The audience is everyone shipping LLM calls. Agent Observability + Tracing is the debugging layer for agentic systems where a single user request fans out into many LLM calls, tool invocations, and sub-agent handoffs. The audience is the engineer trying to reconstruct what an agent actually did, span by span. Most production teams need both: observability for the dashboard, tracing for the post-mortem.

Do I need all six platforms, or can I pick one?

Start with one trace store: Langfuse if you want self-host and framework-agnostic, LangSmith if you're LangChain-native and want zero-ops, Phoenix if you live in a notebook and want eval-first, Helicone if you want a one-line proxy with no SDK changes. AgentOps adds session replay specifically for agent workflows — pair it with one of the four above. OpenTelemetry isn't a platform; it's the wire format your instrumentation should emit so you can swap backends later without rewriting code.

Can I trace tool calls and sub-agents, not just LLM calls?

Yes — that's the whole point of this pack. OpenInference semantic conventions define span kinds for LLM, CHAIN, RETRIEVER, TOOL, AGENT, and EMBEDDING. Every platform here renders the full tree with parent-child links. The pitfall is that you have to actually instrument the tool dispatcher, not just the LLM SDK — if you only wrap the model call, tool execution time is invisible and you'll mis-attribute slowness to the model.

How much overhead does deep tracing add?

Per-span overhead is sub-millisecond with async batched export. The real cost is storage: a single agent run with 30 LLM calls, 50 tool calls, and full input/output payloads is roughly 200–500 KB. At 10k runs/day that's 2–5 GB/day. Sample 100% on errors and high-eval-cost runs, 10% on routine runs, and self-host Langfuse or Phoenix to keep the storage bill predictable.

Will this work with non-LangChain agents (CrewAI, AutoGen, custom)?

Yes. Langfuse, Phoenix, Helicone, and AgentOps are all framework-agnostic — they accept OpenInference spans from any source. CrewAI ships built-in AgentOps integration; AutoGen has Langfuse adapters; for custom Python agents the OpenInference SDK gives you decorators (@trace) and context managers that work without a framework. LangSmith is the one that pushes hardest on LangChain-specific features, but its API also accepts arbitrary spans.

MÁS DEL ARSENAL

12 packs · 80+ recursos seleccionados

Explora todos los packs curados en la página principal

Volver a todos los packs