How do I install Phoenix Evals — LLM-as-Judge Library with Built-in Templates?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsMay 8, 2026·4 min de lectura

Phoenix Evals — LLM-as-Judge Library with Built-in Templates

Name: Phoenix Evals — LLM-as-Judge Library with Built-in Templates
Author: Arize AI

Phoenix Evals runs LLM-as-judge on traces or datasets. Pre-built templates: hallucination, relevance, toxicity, QA. Outputs scored DataFrames.

Arize AI · Community

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: New

Entrada

Asset

Comando CLI universal

npx tokrepo install 91b1b2a3-8be3-42c3-9366-c71fe29ed30d

contrato de instalación JSON de metadata plan adaptador contenido raw

Introducción

Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.

Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator	Score	What it judges
`HallucinationEvaluator`	factual / hallucinated	Is output supported by reference?
`RelevanceEvaluator`	relevant / unrelated	Does retrieved chunk match query?
`QAEvaluator`	correct / incorrect	Does answer match ground-truth?
`ToxicityEvaluator`	toxic / non-toxic	Hate, harassment, violence in output
`SummarizationEvaluator`	good / poor	Does summary cover source faithfully?
`CodeReadabilityEvaluator`	readable / unreadable	Is generated code clean and idiomatic?

FAQ

Q: Why use a smaller LLM as judge? A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.

Q: Can I write a custom evaluator? A: Yes — subclass LLMEvaluator, supply a prompt template with {input}, {output}, {reference} placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.

Q: Are these reliable for production gating? A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.

Quick Use

pip install arize-phoenix[evals]
Pick an evaluator (HallucinationEvaluator / QAEvaluator / etc.)
run_evals(df, [evaluator(judge_model)]) — get a scored DataFrame back

Intro

Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator	Score	What it judges
`HallucinationEvaluator`	factual / hallucinated	Is output supported by reference?
`RelevanceEvaluator`	relevant / unrelated	Does retrieved chunk match query?
`QAEvaluator`	correct / incorrect	Does answer match ground-truth?
`ToxicityEvaluator`	toxic / non-toxic	Hate, harassment, violence in output
`SummarizationEvaluator`	good / poor	Does summary cover source faithfully?
`CodeReadabilityEvaluator`	readable / unreadable	Is generated code clean and idiomatic?

FAQ

Source & Thanks

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

🙏

Fuente y agradecimientos

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Phoenix — Open Source AI Observability

Phoenix is an AI observability platform for tracing, evaluating, and debugging LLM apps. 9.1K+ stars. OpenTelemetry, evals, prompt management.

Configs

Arize AI

Together AI Evaluations Skill for Claude Code

Skill that teaches Claude Code Together AI's LLM evaluation framework. Run LLM-as-a-judge evaluations to score model outputs on quality, safety, and task completion.

Skills

Together AI

LLM Foundry — LLM Training Code for Foundation Models by Databricks

An open-source library for training, fine-tuning, and evaluating large language models, built on the Composer training library by MosaicML/Databricks.

Configs

AI Open Source

Lit — Simple Library for Fast Lightweight Web Components

Lit is a simple library for building fast, lightweight web components. Built by Google on top of the standard Web Components APIs, it provides reactive properties, scoped styles, and a declarative templating system in about 5KB.

Configs

AI Open Source