Practical Notes
- Quant: define 3 core metrics per agent workflow (latency, tool-call count, success rate) and baseline them before you optimize prompts.
- Quant: keep a replay set of 20 representative runs; compare traces after every change to detect regressions.
Observability-first iteration
If you can’t answer these questions with data, you’re guessing:
- Which step dominates latency?
- Which tool calls fail most often?
- Which prompt change improved success rate vs just “felt better”?
Minimal instrumentation strategy
- Trace every run with a stable run id.
- Attach tool-call spans with inputs/outputs (redact secrets).
- Capture final outcomes (pass/fail + reason).
Don’t drown in dashboards
Start with one workflow and one team. Once the metrics are trusted, scale to more services.
FAQ
Q: Do I need to self-host? A: No. The repo documents self-hosting; teams can choose managed options or local-only usage.
Q: What should I instrument first? A: One end-to-end workflow that currently fails or is slow—make it measurable.
Q: How do I compare prompt changes? A: Use a fixed replay set and compare traces/metrics, not anecdotes.