Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsMay 8, 2026·4 min de lectura

Phoenix Evals — LLM-as-Judge Library with Built-in Templates

Phoenix Evals runs LLM-as-judge on traces or datasets. Pre-built templates: hallucination, relevance, toxicity, QA. Outputs scored DataFrames.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install 91b1b2a3-8be3-42c3-9366-c71fe29ed30d
Introducción

Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.


Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator Score What it judges
HallucinationEvaluator factual / hallucinated Is output supported by reference?
RelevanceEvaluator relevant / unrelated Does retrieved chunk match query?
QAEvaluator correct / incorrect Does answer match ground-truth?
ToxicityEvaluator toxic / non-toxic Hate, harassment, violence in output
SummarizationEvaluator good / poor Does summary cover source faithfully?
CodeReadabilityEvaluator readable / unreadable Is generated code clean and idiomatic?

FAQ

Q: Why use a smaller LLM as judge? A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.

Q: Can I write a custom evaluator? A: Yes — subclass LLMEvaluator, supply a prompt template with {input}, {output}, {reference} placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.

Q: Are these reliable for production gating? A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.


Quick Use

  1. pip install arize-phoenix[evals]
  2. Pick an evaluator (HallucinationEvaluator / QAEvaluator / etc.)
  3. run_evals(df, [evaluator(judge_model)]) — get a scored DataFrame back

Intro

Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.


Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator Score What it judges
HallucinationEvaluator factual / hallucinated Is output supported by reference?
RelevanceEvaluator relevant / unrelated Does retrieved chunk match query?
QAEvaluator correct / incorrect Does answer match ground-truth?
ToxicityEvaluator toxic / non-toxic Hate, harassment, violence in output
SummarizationEvaluator good / poor Does summary cover source faithfully?
CodeReadabilityEvaluator readable / unreadable Is generated code clean and idiomatic?

FAQ

Q: Why use a smaller LLM as judge? A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.

Q: Can I write a custom evaluator? A: Yes — subclass LLMEvaluator, supply a prompt template with {input}, {output}, {reference} placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.

Q: Are these reliable for production gating? A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.


Source & Thanks

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

🙏

Fuente y agradecimientos

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados