How do I install Phoenix Evals — LLM-as-Judge Library with Built-in Templates?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cette page est affichée en anglais. Une traduction française est en cours.

SkillsMay 8, 2026·4 min de lecture

Phoenix Evals — LLM-as-Judge Library with Built-in Templates

Name: Phoenix Evals — LLM-as-Judge Library with Built-in Templates
Author: Arize AI

Phoenix Evals runs LLM-as-judge on traces or datasets. Pre-built templates: hallucination, relevance, toxicity, QA. Outputs scored DataFrames.

Arize AI · Community

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser

Surface agent

Tout agent MCP/CLI

Type

Skill

Installation

Single

Confiance

Confiance : New

Point d'entrée

Asset

Commande CLI universelle

npx tokrepo install 91b1b2a3-8be3-42c3-9366-c71fe29ed30d

contrat d'installation JSON metadata plan adaptateur contenu raw

Introduction

Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.

Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator	Score	What it judges
`HallucinationEvaluator`	factual / hallucinated	Is output supported by reference?
`RelevanceEvaluator`	relevant / unrelated	Does retrieved chunk match query?
`QAEvaluator`	correct / incorrect	Does answer match ground-truth?
`ToxicityEvaluator`	toxic / non-toxic	Hate, harassment, violence in output
`SummarizationEvaluator`	good / poor	Does summary cover source faithfully?
`CodeReadabilityEvaluator`	readable / unreadable	Is generated code clean and idiomatic?

FAQ

Q: Why use a smaller LLM as judge? A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.

Q: Can I write a custom evaluator? A: Yes — subclass LLMEvaluator, supply a prompt template with {input}, {output}, {reference} placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.

Q: Are these reliable for production gating? A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.

Quick Use

pip install arize-phoenix[evals]
Pick an evaluator (HallucinationEvaluator / QAEvaluator / etc.)
run_evals(df, [evaluator(judge_model)]) — get a scored DataFrame back

Intro

Quick eval — hallucination + relevance

import pandas as pd
from phoenix.evals import (
    HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)

df = pd.DataFrame({
    "input": ["Who was the first US president?"] * 3,
    "reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
    "output": [
        "George Washington was the first US president.",        # correct
        "Thomas Jefferson was the first US president.",         # hallucinated
        "George Washington was the third US president.",        # wrong fact
    ],
})

judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
    dataframe=df,
    evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
    provide_explanation=True,
)

print(hallucination_evals[["label", "score", "explanation"]])

Run on production traces

import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel

# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")

# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
    "attributes.retrieval.documents": "reference",
})

(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])

# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))

Built-in evaluator templates

Evaluator	Score	What it judges
`HallucinationEvaluator`	factual / hallucinated	Is output supported by reference?
`RelevanceEvaluator`	relevant / unrelated	Does retrieved chunk match query?
`QAEvaluator`	correct / incorrect	Does answer match ground-truth?
`ToxicityEvaluator`	toxic / non-toxic	Hate, harassment, violence in output
`SummarizationEvaluator`	good / poor	Does summary cover source faithfully?
`CodeReadabilityEvaluator`	readable / unreadable	Is generated code clean and idiomatic?

FAQ

Source & Thanks

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

🙏

Source et remerciements

Built by Arize AI. Licensed under Apache-2.0.

Arize-ai/phoenix — ⭐ 4,500+

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Phoenix — Open Source AI Observability

Phoenix is an AI observability platform for tracing, evaluating, and debugging LLM apps. 9.1K+ stars. OpenTelemetry, evals, prompt management.

Configs

Arize AI

Together AI Evaluations Skill for Claude Code

Skill that teaches Claude Code Together AI's LLM evaluation framework. Run LLM-as-a-judge evaluations to score model outputs on quality, safety, and task completion.

Skills

Together AI

LLM Foundry — LLM Training Code for Foundation Models by Databricks

An open-source library for training, fine-tuning, and evaluating large language models, built on the Composer training library by MosaicML/Databricks.

Configs

AI Open Source

Lit — Simple Library for Fast Lightweight Web Components

Lit is a simple library for building fast, lightweight web components. Built by Google on top of the standard Web Components APIs, it provides reactive properties, scoped styles, and a declarative templating system in about 5KB.

Configs

AI Open Source