Practical Notes
- Quant: start with 3–5 golden prompts and record a baseline score per release.
- Quant: monitor eval latency and cost; cap evaluations per request in production.
Pattern: separate tracing from judging
Treat tracing as the source of truth (what happened), and judging as an asynchronous step (how good it was).
A practical rollout:
- Trace everything in staging.
- Pick 3 high-risk paths (tool call safety, RAG correctness, refusal behavior).
- Add a small set of evals and expand only when signal is stable.
Operational note
Store keys securely and avoid placing sensitive payloads into traces. Redaction/scrubbing should be part of the initial setup.
FAQ
Q: Do I need an account? A: The README references API keys and a dashboard; plan on setting up an account for full functionality.
Q: What should I evaluate first? A: Tool-call safety, correctness of retrieved facts, and refusal/guardrail compliance.
Q: How do I keep costs under control? A: Sample traffic, cap evaluations per request, and run heavier suites in CI/staging.