TOKREPO · Arsenal de IA

Estable

Observabilidad + Tracing de Agentes

Siete picks para el ingeniero ML/LLM que necesita responder '¿por qué hizo eso el agente?' — LangSmith, Langfuse, Phoenix, Helicone, AgentOps, OpenTelemetry for LLM. Trazas span por span de tool calls, retries, sub-agentes y bucles de reflexión — no solo dashboards de costo por prompt.

7 recursos

Sobre este pack

What's in this pack

The day an agent silently loops between two tools for 47 turns and returns a confident wrong answer is the day you wish you had span-level traces, not a per-prompt cost chart. This pack is built for the ML/LLM engineer trying to reconstruct what an agent actually did: which sub-agent fired, what arguments went into each tool call, how many retries it ate, what the planner thought before it pivoted.

#	Asset	Tier	What it traces
1	LangSmith	hosted	first-party LangChain / LangGraph spans, dataset replay, eval bridge
2	Langfuse	open-source	framework-agnostic span trees, prompt versioning, evaluator hooks
3	Arize Phoenix	open-source	OpenInference spans, built-in retrieval / agent evaluators, notebook-first
4	Helicone	hybrid	proxy-based tracing, no SDK install, cost + caching + sessions
5	AgentOps	open-source	agent session replay, tool-call timelines, multi-agent step graphs
6	OpenTelemetry for LLM	spec	OpenInference + GenAI semantic conventions — vendor-neutral span format
7	Eval-bridged trace store	pattern	every trace gets a quality score, alerted when score regresses inside a session

How this is different from the LLM Observability pack

If you're not sure which pack to install: LLM Observability is the runtime telemetry layer — token cost, p95 latency, error rate, prompt-version dashboards. The audience is anyone shipping LLM calls to production. Agent Observability + Tracing is the debugging layer for systems where one user request fans out into 10–100 LLM calls, tool invocations, and sub-agent handoffs. The audience is the engineer staring at a 4-minute agent run that returned garbage and trying to figure out which step lied.

A cost dashboard tells you the bill went up 30%. A deep trace tells you the planner sub-agent retried the same search 8 times because the tool returned an empty array and the prompt didn't handle it. You want both, but they answer different questions.

Install in a deliberate order

# Full pack
tokrepo install pack/agent-observability-tracing

# Or layer it up
tokrepo install langfuse        # 1. trace store + UI
tokrepo install opentelemetry-llm  # 2. instrument once, swap backends later
tokrepo install phoenix         # 3. eval bridge for trace-level scoring

Five layers, install in this order:

Instrumenter — wrap your LLM SDK and agent framework. OpenTelemetry + OpenInference semantic conventions is the right default: instrument once, swap backends without code changes. If you're LangChain-native, the built-in langchain_core.tracers writes directly to LangSmith / Langfuse.
Trace store + UI — pick Langfuse self-hosted (data sovereignty), LangSmith Cloud (zero ops, LangChain-tight), Phoenix local (notebook-first, no infra), or Helicone proxy (no SDK install at all). They all consume OTel spans now.
Eval bridge — wire your offline evals (LLM-as-judge, retrieval recall, tool-call correctness) into the same trace store so quality scores land on every span. Phoenix and Langfuse both ship this; LangSmith calls it 'feedback'.
Alerts — fire on trace-level anomalies, not just per-call ones: agent ran >20 steps, retry depth >5, sub-agent never called expected tool, planner output didn't include required schema. These are the failure modes a per-prompt dashboard misses entirely.
Session replay — AgentOps and Helicone both group spans into 'sessions' (one user request = one session). For multi-agent systems this is non-negotiable. Without it you cannot tell two simultaneous user runs apart in the timeline.

Common pitfalls

Tracing the LLM call but not the tool call. The model emits a tool call; your code runs the tool; the result feeds the next LLM turn. If you only instrument the LLM SDK, the tool execution is a black hole. Wrap your tool dispatcher with the same tracer.
No parent_span_id on sub-agent handoffs. If sub-agent B is spawned by agent A, B's spans must carry A's trace ID. Otherwise the UI shows two disconnected timelines and you cannot answer 'who called whom'.
Logging full reasoning chains as a single blob. A reflection loop with 30 thoughts shouldn't be one giant string field — it should be 30 sibling spans under a reflection parent. Filtering, search, and diff all break on the blob shape.
Sampling agent traces uniformly. Sample 10% of normal runs, but always keep 100% of runs where the trace exited with an error, hit max retries, or had eval score below threshold. The bugs you need to debug are exactly the runs you'd otherwise drop.
Vendor-locking your spans. Use OpenInference / OpenTelemetry GenAI semantic conventions. Every backend in this pack speaks them. Hand-rolling proprietary JSON means rewriting your instrumentation when you migrate.
No prompt-version → trace-version link. When a trace was generated by prompt version 7 but you've since shipped version 9, the trace UI must surface the version so you can diff old vs new behavior. Langfuse and LangSmith both support this; wire it on day one.

Pair with these packs

Agent Observability is the debugger. The LLM Observability pack is the production dashboard. The Multi-Agent Frameworks pack is the system being traced (LangGraph, CrewAI, AutoGen). The LLM Eval & Guardrails pack is the scoring engine that turns raw traces into quality signals on the same dashboard. Real teams run all four together — observability without eval is just pretty timelines, eval without traces is just averages.

INSTALAR · UN COMANDO

$ tokrepo install pack/agent-observability-tracing

pásalo a tu agente — o pégalo en tu terminal

Qué incluye

7 recursos listos para instalar

Skill#01

Langfuse — Open Source LLM Observability

Langfuse is an open-source LLM engineering platform for tracing, prompt management, evaluation, and debugging AI apps. 24.1K+ GitHub stars. Self-hosted or cloud. MIT.

by Langfuse·343 views

$ tokrepo install langfuse-open-source-llm-observability-49a8eb0b

Skill#02

AgentOps — Observability for AI Agents

Python SDK for AI agent monitoring. LLM cost tracking, session replay, benchmarking, and error analysis. Integrates with CrewAI, LangChain, AutoGen, and more. 5.4K+ stars.

by Script Depot·276 views

$ tokrepo install agentops-observability-ai-agents-d570c84f

Prompt#03

LangSmith — Prompt Debugging and LLM Observability

Debug, test, and monitor LLM applications in production. LangSmith provides trace visualization, prompt playground, dataset evaluation, and regression testing for AI.

by Prompt Lab·371 views

$ tokrepo install langsmith-prompt-debugging-llm-observability-4d9432ea

Skill#04

Phoenix — Open Source AI Observability

Phoenix is an AI observability platform for tracing, evaluating, and debugging LLM apps. 9.1K+ stars. OpenTelemetry, evals, prompt management.

by Arize AI·324 views

$ tokrepo install phoenix-open-source-ai-observability-42fa8573

Skill#05

OpenLIT — OpenTelemetry LLM Observability

Monitor LLM costs, latency, and quality with OpenTelemetry-native tracing. GPU monitoring and guardrails built in. 2.3K+ stars.

by AI Open Source·330 views

$ tokrepo install openlit-opentelemetry-llm-observability-13e3c714

Skill#06

Langtrace — Open Source AI Observability Platform

Open-source observability for LLM apps. Trace OpenAI, Anthropic, and LangChain calls with OpenTelemetry-native instrumentation and a real-time dashboard.

by AI Open Source·289 views

$ tokrepo install langtrace-open-source-ai-observability-platform-a53444d6

Skill#07

Gemini CLI Extension: Observability — Monitoring & Logs

Gemini CLI extension for Google Cloud observability. Set up monitoring, analyze logs, create dashboards, and configure alerts.

by Google · Gemini Team·377 views

$ tokrepo install gemini-cli-extension-observability-monitoring-logs-aa41279c

Preguntas frecuentes

How is this different from the LLM Observability pack?

LLM Observability is the runtime telemetry layer — per-prompt cost, p95 latency, error rate, version-over-version dashboards. The audience is everyone shipping LLM calls. Agent Observability + Tracing is the debugging layer for agentic systems where a single user request fans out into many LLM calls, tool invocations, and sub-agent handoffs. The audience is the engineer trying to reconstruct what an agent actually did, span by span. Most production teams need both: observability for the dashboard, tracing for the post-mortem.

Do I need all six platforms, or can I pick one?

Start with one trace store: Langfuse if you want self-host and framework-agnostic, LangSmith if you're LangChain-native and want zero-ops, Phoenix if you live in a notebook and want eval-first, Helicone if you want a one-line proxy with no SDK changes. AgentOps adds session replay specifically for agent workflows — pair it with one of the four above. OpenTelemetry isn't a platform; it's the wire format your instrumentation should emit so you can swap backends later without rewriting code.

Can I trace tool calls and sub-agents, not just LLM calls?

Yes — that's the whole point of this pack. OpenInference semantic conventions define span kinds for LLM, CHAIN, RETRIEVER, TOOL, AGENT, and EMBEDDING. Every platform here renders the full tree with parent-child links. The pitfall is that you have to actually instrument the tool dispatcher, not just the LLM SDK — if you only wrap the model call, tool execution time is invisible and you'll mis-attribute slowness to the model.

How much overhead does deep tracing add?

Per-span overhead is sub-millisecond with async batched export. The real cost is storage: a single agent run with 30 LLM calls, 50 tool calls, and full input/output payloads is roughly 200–500 KB. At 10k runs/day that's 2–5 GB/day. Sample 100% on errors and high-eval-cost runs, 10% on routine runs, and self-host Langfuse or Phoenix to keep the storage bill predictable.

Will this work with non-LangChain agents (CrewAI, AutoGen, custom)?

Yes. Langfuse, Phoenix, Helicone, and AgentOps are all framework-agnostic — they accept OpenInference spans from any source. CrewAI ships built-in AgentOps integration; AutoGen has Langfuse adapters; for custom Python agents the OpenInference SDK gives you decorators (@trace) and context managers that work without a framework. LangSmith is the one that pushes hardest on LangChain-specific features, but its API also accepts arbitrary spans.

MÁS DEL ARSENAL

12 packs · 80+ recursos seleccionados

Explora todos los packs curados en la página principal

Volver a todos los packs