LLM Observability
Arize Phoenix — Open-source LLM Observability & Evals logo

Arize Phoenix — Open-source LLM Observability & Evals

Arize Phoenix is the open-source observability and evaluation library from Arize AI. OpenTelemetry-native, with strong eval primitives — built for data scientists and ML engineers who want notebooks + production in one stack.

Why Phoenix

Phoenix leans toward experimentation and eval. A data scientist can spin up Phoenix in a notebook with phoenix.launch_app(), send in OpenTelemetry traces from a RAG pipeline, run evals against datasets, and iterate in the same environment — without any server deployment. For production, the same library deploys as a long-running service with Postgres.

The eval library is a standout. Phoenix ships pre-built evaluators for hallucination, toxicity, relevance, QA correctness, and retrieval accuracy. Each is a tested prompt template you can apply at dataset-scale with a few lines of code. This is the fastest path from "I think my RAG is bad" to "here are the specific queries it fails on".

Relative to Langfuse: Phoenix is more research-tool, less product-tool. Its UI is functional but less polished; its prompt management is lighter; its focus is on helping you diagnose and improve more than on operating a production ops dashboard. Many teams use both — Phoenix in notebooks during development, Langfuse in production.

Quick Start — Notebook Launch + OpenAI

launch_app() is the notebook-friendly mode — Phoenix runs in-process with an HTTP endpoint for OTEL ingestion and a web UI. For production, deploy Phoenix Server (docker-compose) and point instrumentation at it. OpenInference is Arize’s OTEL instrumentation library — supports OpenAI, Anthropic, LangChain, LlamaIndex, DSPy, Haystack, LiteLLM out of the box.

# pip install 'arize-phoenix[evals]' openinference-instrumentation-openai opentelemetry-sdk
import phoenix as px

# Launch the Phoenix UI locally (notebook or script)
session = px.launch_app()
print(session.url)   # open in browser

# Instrument OpenAI via OpenInference (Arize's OTEL libraries for LLM frameworks)
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(f"{session.url}/v1/traces")))
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()

from openai import OpenAI
client = OpenAI()

for q in ["Why is the sky blue?", "How do planes fly?", "What is photosynthesis?"]:
    client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": q}],
    )

# Phoenix UI now shows traces with prompt/response/latency/cost per request.
# Now run an eval over the last 10 traces:
from phoenix.evals import HallucinationEvaluator, OpenAIModel, run_evals

trace_df = px.Client().get_spans_dataframe()
hallu_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4o-mini"))
scores = run_evals(dataframe=trace_df, evaluators=[hallu_evaluator])
print(scores)

Key Features

OpenTelemetry native

Ingests OTEL traces from any instrumented framework via OpenInference (Arize’s LLM-specific OTEL library). Interoperates with Jaeger, Tempo, and generic OTEL collectors.

Pre-built evaluators

Hallucination, toxicity, relevance, QA correctness, retrieval precision, code generation — LLM-as-judge evaluators with tested prompts. Apply to trace datasets with one line.

Dataset workflows

Curate datasets from production traces, tag examples, replay on new prompts or models, diff results. Tight loop between "production bug" and "fixed evaluation dataset".

Embedding + RAG diagnostics

UMAP visualization of embeddings, RAG-specific metrics (context relevance, answer relevance, groundedness). Particularly strong for debugging retrieval pipelines.

Notebook-first

launch_app() runs Phoenix in a notebook; same library runs as a production server with Postgres. Low friction between research and production.

OSS + commercial sibling

Phoenix OSS is Elastic License 2.0. For enterprise, Arize AX is the paid managed offering with role-based access, team features, and deeper analytics.

Comparison

 Primary StrengthDeploymentEval LibraryAudience
Arize PhoenixthisEval + embedding diagnosticsNotebook + self-hostStrongest (pre-built evaluators)Data scientists / ML engineers
LangfuseProduction ops + prompt mgmtCloud + self-hostLLM-as-judge frameworkProduction engineers
HeliconeZero-code integrationCloud + self-hostBasicFull-stack teams
TraceloopOTEL evangelismAgent + backendVia integrationsOTEL users

Use Cases

01. RAG debugging

When retrieval quality is the problem, Phoenix’s RAG metrics (context relevance, groundedness) + embedding UMAP visualization isolate whether the issue is chunking, embeddings, or the generation prompt.

02. ML/Data Science research

Teams where LLM work lives in notebooks — experiment with prompts, run evals on held-out datasets, compare model versions. Phoenix’s notebook-first design fits the workflow.

03. Production + dev parity

Same instrumentation, same Phoenix UI, dev and production. Reduces the usual "works on my laptop, mysterious in prod" gap for LLM apps.

Pricing & License

Phoenix: Elastic License 2.0 — free to use (including for commercial purposes); restrictions apply to resale as a hosted service. Full feature set available self-hosted.

Arize AX: managed enterprise offering from Arize AI. Adds SSO, team management, enterprise support, deeper analytics, enhanced dashboards. Pricing by volume — contact Arize sales.

Cost reality: self-hosted Phoenix is free for compute; you pay for the Postgres it needs plus your own LLM eval calls. For teams already set up for OTEL, marginal cost is low.

Related Assets on TokRepo

Frequently Asked Questions

Phoenix vs Langfuse — which is better?+

Different bets. Phoenix is stronger on eval library and embedding diagnostics; Langfuse is stronger on production ops UX and prompt management. Research/eval-heavy teams often prefer Phoenix; production engineering teams often prefer Langfuse. Many shops use both.

Is Phoenix really OpenTelemetry-compatible?+

Yes. Phoenix is an OTEL backend — accepts OTLP over HTTP/gRPC. The OpenInference library (from Arize) provides LLM-specific instrumentation on top of OTEL’s base conventions. You can mix Phoenix traces with generic OTEL traces from other sources.

Can I use Phoenix with LangChain?+

Yes. pip install openinference-instrumentation-langchain; call LangChainInstrumentor().instrument() once at startup. All LangChain components (chains, agents, retrievers) emit structured OTEL spans to Phoenix.

Do I need to deploy Phoenix Server for a small app?+

No. For dev, px.launch_app() runs an in-process server in your notebook. For production, deploy Phoenix Server with Postgres — docker-compose configs ship in the repo.

Is the Elastic License a problem for commercial use?+

For most commercial users, no — you can run Phoenix in production inside your company without issue. The license restriction targets resale as a SaaS competing with Arize. Check with your legal team for specifics if you’re building a platform product.

Compare Alternatives