# Phoenix Evals — LLM-as-Judge Library with Built-in Templates

> Phoenix Evals runs LLM-as-judge on traces or datasets. Pre-built templates: hallucination, relevance, toxicity, QA. Outputs scored DataFrames.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

1. `pip install arize-phoenix[evals]`
2. Pick an evaluator (HallucinationEvaluator / QAEvaluator / etc.)
3. `run_evals(df, [evaluator(judge_model)])` — get a scored DataFrame back

---

## Intro

Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.

---

### Quick eval — hallucination + relevance

```python
import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])
```

### Run on production traces

```python
import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))
```

### Built-in evaluator templates

| Evaluator | Score | What it judges |
|---|---|---|
| `HallucinationEvaluator` | factual / hallucinated | Is output supported by reference? |
| `RelevanceEvaluator` | relevant / unrelated | Does retrieved chunk match query? |
| `QAEvaluator` | correct / incorrect | Does answer match ground-truth? |
| `ToxicityEvaluator` | toxic / non-toxic | Hate, harassment, violence in output |
| `SummarizationEvaluator` | good / poor | Does summary cover source faithfully? |
| `CodeReadabilityEvaluator` | readable / unreadable | Is generated code clean and idiomatic? |

---

### FAQ

**Q: Why use a smaller LLM as judge?**
A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.

**Q: Can I write a custom evaluator?**
A: Yes — subclass `LLMEvaluator`, supply a prompt template with `{input}`, `{output}`, `{reference}` placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.

**Q: Are these reliable for production gating?**
A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.

---

## Source & Thanks

> Built by [Arize AI](https://github.com/Arize-ai). Licensed under Apache-2.0.
>
> [Arize-ai/phoenix](https://github.com/Arize-ai/phoenix) — ⭐ 4,500+

---

<!-- ZH -->

## 快速使用

1. `pip install arize-phoenix[evals]`
2. 挑一个 evaluator（HallucinationEvaluator / QAEvaluator 等）
3. `run_evals(df, [evaluator(judge_model)])` —— 拿到打分 DataFrame

---

## 简介

Phoenix Evals 在 trace 或数据集上跑 LLM-as-judge 评估 —— 用经过实战的 prompt 模板给输出打分：幻觉、retrieval 相关性、QA 正确性、毒性、摘要质量、代码可读性。返回 pandas DataFrame；merge 回 span 在 UI 里筛差的。适合发版前 prompt 回归测试、找出 agent 运行最差的 5%、从生产 trace 建人工策划的数据集。兼容 OpenAI、Anthropic、Bedrock、VertexAI、任何能做 judge 的模型。装机时间 5 分钟。

---

### 快速评估 —— 幻觉 + 相关性

```python
import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["谁是美国第一任总统？"] * 3,
    "reference": ["乔治·华盛顿是美国第一任总统，任期 1789–1797。"] * 3,
    "output": [
        "乔治·华盛顿是美国第一任总统。",        # 对
        "托马斯·杰斐逊是美国第一任总统。",       # 幻觉
        "乔治·华盛顿是美国第三任总统。",        # 事实错
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])
```

### 在生产 trace 上跑

```python
import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# 拉最近 24 小时的 trace
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# 把 span 列适配到 eval 输入
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# 评分回灌 Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))
```

### 内置评估器模板

| 评估器 | 分数 | 判什么 |
|---|---|---|
| `HallucinationEvaluator` | factual / hallucinated | 输出有 reference 支撑吗？|
| `RelevanceEvaluator` | relevant / unrelated | 拉到的 chunk 跟 query 配吗？|
| `QAEvaluator` | correct / incorrect | 答案跟 ground-truth 配吗？|
| `ToxicityEvaluator` | toxic / non-toxic | 仇恨、骚扰、暴力 |
| `SummarizationEvaluator` | good / poor | 摘要忠实覆盖原文吗？|
| `CodeReadabilityEvaluator` | readable / unreadable | 生成代码干净 idiomatic 吗？|

---

### FAQ

**Q: 为啥用更小的 LLM 当 judge？**
A: 成本。gpt-4o-mini 判 1 万条 ~2 美元，gpt-4o ~30 美元。事实评估上 mini 跟 gpt-4o 大约 90% 一致。仅在分歧解决跑里用 gpt-4o。

**Q: 能写自定义 evaluator 吗？**
A: 能 —— 继承 `LLMEvaluator`，提供带 `{input}` / `{output}` / `{reference}` 占位符的 prompt 模板，再给一个允许标签 rail。框架处理 batching、重试、解析。

**Q: 生产环境能用它做门禁吗？**
A: 当冒烟测试用，别当门禁。LLM judge 在标准任务上跟人类大约 85-92% 一致。用 eval 把候选筛出来给人复核，别静默阻塞发版。

---

## 来源与感谢

> Built by [Arize AI](https://github.com/Arize-ai). Licensed under Apache-2.0.
>
> [Arize-ai/phoenix](https://github.com/Arize-ai/phoenix) — ⭐ 4,500+


---
Source: https://tokrepo.com/en/workflows/phoenix-evals-llm-as-judge-library-with-built-in-templates
Author: Arize AI