Observabilidad + Tracing de Agentes
Siete picks para el ingeniero ML/LLM que necesita responder '¿por qué hizo eso el agente?' — LangSmith, Langfuse, Phoenix, Helicone, AgentOps, OpenTelemetry for LLM. Trazas span por span de tool calls, retries, sub-agentes y bucles de reflexión — no solo dashboards de costo por prompt.
What's in this pack
The day an agent silently loops between two tools for 47 turns and returns a confident wrong answer is the day you wish you had span-level traces, not a per-prompt cost chart. This pack is built for the ML/LLM engineer trying to reconstruct what an agent actually did: which sub-agent fired, what arguments went into each tool call, how many retries it ate, what the planner thought before it pivoted.
| # | Asset | Tier | What it traces |
|---|---|---|---|
| 1 | LangSmith | hosted | first-party LangChain / LangGraph spans, dataset replay, eval bridge |
| 2 | Langfuse | open-source | framework-agnostic span trees, prompt versioning, evaluator hooks |
| 3 | Arize Phoenix | open-source | OpenInference spans, built-in retrieval / agent evaluators, notebook-first |
| 4 | Helicone | hybrid | proxy-based tracing, no SDK install, cost + caching + sessions |
| 5 | AgentOps | open-source | agent session replay, tool-call timelines, multi-agent step graphs |
| 6 | OpenTelemetry for LLM | spec | OpenInference + GenAI semantic conventions — vendor-neutral span format |
| 7 | Eval-bridged trace store | pattern | every trace gets a quality score, alerted when score regresses inside a session |
How this is different from the LLM Observability pack
If you're not sure which pack to install: LLM Observability is the runtime telemetry layer — token cost, p95 latency, error rate, prompt-version dashboards. The audience is anyone shipping LLM calls to production. Agent Observability + Tracing is the debugging layer for systems where one user request fans out into 10–100 LLM calls, tool invocations, and sub-agent handoffs. The audience is the engineer staring at a 4-minute agent run that returned garbage and trying to figure out which step lied.
A cost dashboard tells you the bill went up 30%. A deep trace tells you the planner sub-agent retried the same search 8 times because the tool returned an empty array and the prompt didn't handle it. You want both, but they answer different questions.
Install in a deliberate order
# Full pack
tokrepo install pack/agent-observability-tracing
# Or layer it up
tokrepo install langfuse # 1. trace store + UI
tokrepo install opentelemetry-llm # 2. instrument once, swap backends later
tokrepo install phoenix # 3. eval bridge for trace-level scoring
Five layers, install in this order:
- Instrumenter — wrap your LLM SDK and agent framework. OpenTelemetry + OpenInference semantic conventions is the right default: instrument once, swap backends without code changes. If you're LangChain-native, the built-in
langchain_core.tracerswrites directly to LangSmith / Langfuse. - Trace store + UI — pick Langfuse self-hosted (data sovereignty), LangSmith Cloud (zero ops, LangChain-tight), Phoenix local (notebook-first, no infra), or Helicone proxy (no SDK install at all). They all consume OTel spans now.
- Eval bridge — wire your offline evals (LLM-as-judge, retrieval recall, tool-call correctness) into the same trace store so quality scores land on every span. Phoenix and Langfuse both ship this; LangSmith calls it 'feedback'.
- Alerts — fire on trace-level anomalies, not just per-call ones: agent ran >20 steps, retry depth >5, sub-agent never called expected tool, planner output didn't include required schema. These are the failure modes a per-prompt dashboard misses entirely.
- Session replay — AgentOps and Helicone both group spans into 'sessions' (one user request = one session). For multi-agent systems this is non-negotiable. Without it you cannot tell two simultaneous user runs apart in the timeline.
Common pitfalls
- Tracing the LLM call but not the tool call. The model emits a tool call; your code runs the tool; the result feeds the next LLM turn. If you only instrument the LLM SDK, the tool execution is a black hole. Wrap your tool dispatcher with the same tracer.
- No
parent_span_idon sub-agent handoffs. If sub-agent B is spawned by agent A, B's spans must carry A's trace ID. Otherwise the UI shows two disconnected timelines and you cannot answer 'who called whom'. - Logging full reasoning chains as a single blob. A reflection loop with 30 thoughts shouldn't be one giant string field — it should be 30 sibling spans under a
reflectionparent. Filtering, search, and diff all break on the blob shape. - Sampling agent traces uniformly. Sample 10% of normal runs, but always keep 100% of runs where the trace exited with an error, hit max retries, or had eval score below threshold. The bugs you need to debug are exactly the runs you'd otherwise drop.
- Vendor-locking your spans. Use OpenInference / OpenTelemetry GenAI semantic conventions. Every backend in this pack speaks them. Hand-rolling proprietary JSON means rewriting your instrumentation when you migrate.
- No prompt-version → trace-version link. When a trace was generated by prompt version 7 but you've since shipped version 9, the trace UI must surface the version so you can diff old vs new behavior. Langfuse and LangSmith both support this; wire it on day one.
Pair with these packs
Agent Observability is the debugger. The LLM Observability pack is the production dashboard. The Multi-Agent Frameworks pack is the system being traced (LangGraph, CrewAI, AutoGen). The LLM Eval & Guardrails pack is the scoring engine that turns raw traces into quality signals on the same dashboard. Real teams run all four together — observability without eval is just pretty timelines, eval without traces is just averages.
7 recursos listos para instalar
Preguntas frecuentes
How is this different from the LLM Observability pack?
LLM Observability is the runtime telemetry layer — per-prompt cost, p95 latency, error rate, version-over-version dashboards. The audience is everyone shipping LLM calls. Agent Observability + Tracing is the debugging layer for agentic systems where a single user request fans out into many LLM calls, tool invocations, and sub-agent handoffs. The audience is the engineer trying to reconstruct what an agent actually did, span by span. Most production teams need both: observability for the dashboard, tracing for the post-mortem.
Do I need all six platforms, or can I pick one?
Start with one trace store: Langfuse if you want self-host and framework-agnostic, LangSmith if you're LangChain-native and want zero-ops, Phoenix if you live in a notebook and want eval-first, Helicone if you want a one-line proxy with no SDK changes. AgentOps adds session replay specifically for agent workflows — pair it with one of the four above. OpenTelemetry isn't a platform; it's the wire format your instrumentation should emit so you can swap backends later without rewriting code.
Can I trace tool calls and sub-agents, not just LLM calls?
Yes — that's the whole point of this pack. OpenInference semantic conventions define span kinds for LLM, CHAIN, RETRIEVER, TOOL, AGENT, and EMBEDDING. Every platform here renders the full tree with parent-child links. The pitfall is that you have to actually instrument the tool dispatcher, not just the LLM SDK — if you only wrap the model call, tool execution time is invisible and you'll mis-attribute slowness to the model.
How much overhead does deep tracing add?
Per-span overhead is sub-millisecond with async batched export. The real cost is storage: a single agent run with 30 LLM calls, 50 tool calls, and full input/output payloads is roughly 200–500 KB. At 10k runs/day that's 2–5 GB/day. Sample 100% on errors and high-eval-cost runs, 10% on routine runs, and self-host Langfuse or Phoenix to keep the storage bill predictable.
Will this work with non-LangChain agents (CrewAI, AutoGen, custom)?
Yes. Langfuse, Phoenix, Helicone, and AgentOps are all framework-agnostic — they accept OpenInference spans from any source. CrewAI ships built-in AgentOps integration; AutoGen has Langfuse adapters; for custom Python agents the OpenInference SDK gives you decorators (@trace) and context managers that work without a framework. LangSmith is the one that pushes hardest on LangChain-specific features, but its API also accepts arbitrary spans.
12 packs · 80+ recursos seleccionados
Explora todos los packs curados en la página principal
Volver a todos los packs