Quick Use
pip install arize-phoenix[evals]- Pick an evaluator (HallucinationEvaluator / QAEvaluator / etc.)
run_evals(df, [evaluator(judge_model)])— get a scored DataFrame back
Intro
Phoenix Evals runs LLM-as-judge evaluations on traces or datasets — score outputs for hallucination, retrieval relevance, QA correctness, toxicity, summarization quality, and code readability with battle-tested prompt templates. Returns a pandas DataFrame; merge back to spans to filter the bad ones in the UI. Best for: regression testing prompts before deploy, finding the bottom 5% of agent runs, building human-curated datasets from production traces. Works with: OpenAI, Anthropic, Bedrock, VertexAI, any model usable as a judge. Setup time: 5 minutes.
Quick eval — hallucination + relevance
import pandas as pd
from phoenix.evals import (
HallucinationEvaluator, RelevanceEvaluator, OpenAIModel, run_evals,
)
df = pd.DataFrame({
"input": ["Who was the first US president?"] * 3,
"reference": ["George Washington was the first US president, serving 1789–1797."] * 3,
"output": [
"George Washington was the first US president.", # correct
"Thomas Jefferson was the first US president.", # hallucinated
"George Washington was the third US president.", # wrong fact
],
})
judge = OpenAIModel(model="gpt-4o", temperature=0.0)
hallucination_evals, relevance_evals = run_evals(
dataframe=df,
evaluators=[HallucinationEvaluator(judge), RelevanceEvaluator(judge)],
provide_explanation=True,
)
print(hallucination_evals[["label", "score", "explanation"]])Run on production traces
import phoenix as px
from phoenix.evals import HallucinationEvaluator, OpenAIModel
# Pull last 24h of traces from Phoenix
spans_df = px.Client().query_spans("my-rag-app", project_name="my-rag-app")
# Adapt span columns to eval inputs
spans_df = spans_df.rename(columns={
"attributes.input.value": "input",
"attributes.output.value": "output",
"attributes.retrieval.documents": "reference",
})
(evals_df,) = run_evals(spans_df, [HallucinationEvaluator(OpenAIModel("gpt-4o"))])
# Send eval scores back to Phoenix UI
px.Client().log_evaluations(SpanEvaluations(eval_name="hallucination", dataframe=evals_df))Built-in evaluator templates
| Evaluator | Score | What it judges |
|---|---|---|
HallucinationEvaluator |
factual / hallucinated | Is output supported by reference? |
RelevanceEvaluator |
relevant / unrelated | Does retrieved chunk match query? |
QAEvaluator |
correct / incorrect | Does answer match ground-truth? |
ToxicityEvaluator |
toxic / non-toxic | Hate, harassment, violence in output |
SummarizationEvaluator |
good / poor | Does summary cover source faithfully? |
CodeReadabilityEvaluator |
readable / unreadable | Is generated code clean and idiomatic? |
FAQ
Q: Why use a smaller LLM as judge? A: Cost. Judging 10K traces with gpt-4o-mini is ~$2; with gpt-4o it's ~$30. mini agrees with gpt-4o on roughly 90% of factual evals. Use gpt-4o for the disagreement-resolving runs only.
Q: Can I write a custom evaluator?
A: Yes — subclass LLMEvaluator, supply a prompt template with {input}, {output}, {reference} placeholders, and a rail of allowed labels. The framework handles batching, retries, and parsing.
Q: Are these reliable for production gating? A: Treat them as smoke tests, not gates. LLM judges have ~85-92% agreement with humans on the standard tasks. Use evals to surface candidates for human review, not to block deploys silently.
Source & Thanks
Built by Arize AI. Licensed under Apache-2.0.
Arize-ai/phoenix — ⭐ 4,500+