Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 8, 2026·4 min de lecture

Phoenix Evals — LLM-as-Judge Library with Built-in Templates

Phoenix Evals runs LLM-as-judge on traces or datasets. Pre-built templates: hallucination, relevance, toxicity, QA. Outputs scored DataFrames.

Arize AI
Arize AI · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 91b1b2a3-8be3-42c3-9366-c71fe29ed30d
Introduction

Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.


Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator Score What it judges
HallucinationEvaluator factual / hallucinated Is output supported by reference?
RelevanceEvaluator relevant / unrelated Does retrieved chunk match query?
QAEvaluator correct / incorrect Does answer match ground-truth?
ToxicityEvaluator toxic / non-toxic Hate, harassment, violence in output
SummarizationEvaluator good / poor Does summary cover source faithfully?
CodeReadabilityEvaluator readable / unreadable Is generated code clean and idiomatic?

FAQ

Q: Why use a smaller LLM as judge? A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.

Q: Can I write a custom evaluator? A: Yes — subclass LLMEvaluator, supply a prompt template with {input}, {output}, {reference} placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.

Q: Are these reliable for production gating? A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.


Quick Use

  1. pip install arize-phoenix[evals]
  2. Pick an evaluator (HallucinationEvaluator / QAEvaluator / etc.)
  3. run_evals(df, [evaluator(judge_model)]) — get a scored DataFrame back

Intro

Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.


Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator Score What it judges
HallucinationEvaluator factual / hallucinated Is output supported by reference?
RelevanceEvaluator relevant / unrelated Does retrieved chunk match query?
QAEvaluator correct / incorrect Does answer match ground-truth?
ToxicityEvaluator toxic / non-toxic Hate, harassment, violence in output
SummarizationEvaluator good / poor Does summary cover source faithfully?
CodeReadabilityEvaluator readable / unreadable Is generated code clean and idiomatic?

FAQ

Q: Why use a smaller LLM as judge? A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.

Q: Can I write a custom evaluator? A: Yes — subclass LLMEvaluator, supply a prompt template with {input}, {output}, {reference} placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.

Q: Are these reliable for production gating? A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.


Source & Thanks

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

🙏

Source et remerciements

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires