# Judgeval — Tracing + Evaluation for Agent Apps

> Judgeval adds tracing and evaluation to agent apps, helping teams score behavior and monitor live traffic with a small SDK and dashboard workflow.

## Install

Save as a script file and run:

## Quick Use

1. Install:
   ```bash
   pip install judgeval
   ```
2. Set credentials:
   ```bash
   export JUDGMENT_API_KEY=...
   export JUDGMENT_ORG_ID=...
   ```
3. Verify:
   - Run the README tracing example and confirm at least 1 trace arrives in the dashboard.

## Intro

Judgeval adds tracing and evaluation to agent apps, helping teams score behavior and monitor live traffic with a small SDK and dashboard workflow.

- **Best for:** teams shipping agent backends who need tracing + scoring to catch regressions
- **Works with:** Python agent services, common model SDKs, and production traffic you want to monitor
- **Setup time:** 20–45 minutes

## Practical Notes

- Quant: start with 3–5 golden prompts and record a baseline score per release.
- Quant: monitor eval latency and cost; cap evaluations per request in production.

## Pattern: separate tracing from judging

Treat tracing as the source of truth (what happened), and judging as an asynchronous step (how good it was).

A practical rollout:
- Trace everything in staging.
- Pick 3 high-risk paths (tool call safety, RAG correctness, refusal behavior).
- Add a small set of evals and expand only when signal is stable.

## Operational note

Store keys securely and avoid placing sensitive payloads into traces. Redaction/scrubbing should be part of the initial setup.

### FAQ

**Q: Do I need an account?**
A: The README references API keys and a dashboard; plan on setting up an account for full functionality.

**Q: What should I evaluate first?**
A: Tool-call safety, correctness of retrieved facts, and refusal/guardrail compliance.

**Q: How do I keep costs under control?**
A: Sample traffic, cap evaluations per request, and run heavier suites in CI/staging.

## Source & Thanks

> Source: https://github.com/JudgmentLabs/judgeval
> License: Apache-2.0
> GitHub stars: 1,031 · forks: 93

---

<!-- ZH -->

## 快速使用

1. 安装：
   ```bash
   pip install judgeval
   ```
2. 设置凭据：
   ```bash
   export JUDGMENT_API_KEY=...
   export JUDGMENT_ORG_ID=...
   ```
3. 验证：
   - 跑通 README 的 tracing 示例，确认至少 1 条 trace 到达 dashboard。

## 简介

Judgeval 提供 tracing 与评测能力：用轻量 SDK 采集运行轨迹，并把关键行为指标变成可打分、可回归的评测流程；适合把 agent 质量从“感觉”变成可监控、可对比的工程指标。

- **适合谁：** 上线迭代 agent 后端、需要 tracing + 打分来抓回归的团队
- **可搭配：** Python agent 服务、常见模型 SDK，以及你希望监控的线上流量
- **准备时间：** 20–45 分钟

## 实战建议

- 量化建议：先做 3–5 条黄金样例，按版本记录基线分数。
- 量化建议：监控评测延迟与成本；生产环境限制每次请求触发的评测数量。

## 常用打法：把 tracing 与 judging 分离

把 tracing 当作事实（发生了什么），把 judging 当作异步评测（好不好）。

推荐落地路径：
- staging 全量打点；
- 先选 3 条高风险路径（工具安全、RAG 正确性、拒绝策略）；
- 先做少量评测，信号稳定后再扩展。

## 运维提示

密钥要安全存放，避免把敏感 payload 写进 trace；脱敏应作为接入第一步。

### FAQ

**需要账号吗？**
答：README 需要 API key 与 dashboard，完整功能通常需要注册并配置账号。

**优先评测什么？**
答：工具调用安全、检索事实正确性、拒绝/护栏策略是否合规。

**如何控制成本？**
答：对线上流量采样、限制每次请求的评测数，把重评测放到 CI/staging。

## 来源与感谢

> Source: https://github.com/JudgmentLabs/judgeval
> License: Apache-2.0
> GitHub stars: 1,031 · forks: 93


---
Source: https://tokrepo.com/en/workflows/judgeval-tracing-evaluation-for-agent-apps
Author: Agent Toolkit