# Judgeval — Tracing + Evaluation for Agent Apps > Judgeval adds tracing and evaluation to agent apps, helping teams score behavior and monitor live traffic with a small SDK and dashboard workflow. ## Install Save as a script file and run: ## Quick Use 1. Install: ```bash pip install judgeval ``` 2. Set credentials: ```bash export JUDGMENT_API_KEY=... export JUDGMENT_ORG_ID=... ``` 3. Verify: - Run the README tracing example and confirm at least 1 trace arrives in the dashboard. ## Intro Judgeval adds tracing and evaluation to agent apps, helping teams score behavior and monitor live traffic with a small SDK and dashboard workflow. - **Best for:** teams shipping agent backends who need tracing + scoring to catch regressions - **Works with:** Python agent services, common model SDKs, and production traffic you want to monitor - **Setup time:** 20–45 minutes ## Practical Notes - Quant: start with 3–5 golden prompts and record a baseline score per release. - Quant: monitor eval latency and cost; cap evaluations per request in production. ## Pattern: separate tracing from judging Treat tracing as the source of truth (what happened), and judging as an asynchronous step (how good it was). A practical rollout: - Trace everything in staging. - Pick 3 high-risk paths (tool call safety, RAG correctness, refusal behavior). - Add a small set of evals and expand only when signal is stable. ## Operational note Store keys securely and avoid placing sensitive payloads into traces. Redaction/scrubbing should be part of the initial setup. ### FAQ **Q: Do I need an account?** A: The README references API keys and a dashboard; plan on setting up an account for full functionality. **Q: What should I evaluate first?** A: Tool-call safety, correctness of retrieved facts, and refusal/guardrail compliance. **Q: How do I keep costs under control?** A: Sample traffic, cap evaluations per request, and run heavier suites in CI/staging. ## Source & Thanks > Source: https://github.com/JudgmentLabs/judgeval > License: Apache-2.0 > GitHub stars: 1,031 · forks: 93 --- ## 快速使用 1. 安装: ```bash pip install judgeval ``` 2. 设置凭据: ```bash export JUDGMENT_API_KEY=... export JUDGMENT_ORG_ID=... ``` 3. 验证: - 跑通 README 的 tracing 示例,确认至少 1 条 trace 到达 dashboard。 ## 简介 Judgeval 提供 tracing 与评测能力:用轻量 SDK 采集运行轨迹,并把关键行为指标变成可打分、可回归的评测流程;适合把 agent 质量从“感觉”变成可监控、可对比的工程指标。 - **适合谁:** 上线迭代 agent 后端、需要 tracing + 打分来抓回归的团队 - **可搭配:** Python agent 服务、常见模型 SDK,以及你希望监控的线上流量 - **准备时间:** 20–45 分钟 ## 实战建议 - 量化建议:先做 3–5 条黄金样例,按版本记录基线分数。 - 量化建议:监控评测延迟与成本;生产环境限制每次请求触发的评测数量。 ## 常用打法:把 tracing 与 judging 分离 把 tracing 当作事实(发生了什么),把 judging 当作异步评测(好不好)。 推荐落地路径: - staging 全量打点; - 先选 3 条高风险路径(工具安全、RAG 正确性、拒绝策略); - 先做少量评测,信号稳定后再扩展。 ## 运维提示 密钥要安全存放,避免把敏感 payload 写进 trace;脱敏应作为接入第一步。 ### FAQ **需要账号吗?** 答:README 需要 API key 与 dashboard,完整功能通常需要注册并配置账号。 **优先评测什么?** 答:工具调用安全、检索事实正确性、拒绝/护栏策略是否合规。 **如何控制成本?** 答:对线上流量采样、限制每次请求的评测数,把重评测放到 CI/staging。 ## 来源与感谢 > Source: https://github.com/JudgmentLabs/judgeval > License: Apache-2.0 > GitHub stars: 1,031 · forks: 93 --- Source: https://tokrepo.com/en/workflows/judgeval-tracing-evaluation-for-agent-apps Author: Agent Toolkit