LLM Observability

Arize Phoenix — 开源 LLM 可观测与 Eval 平台

Arize Phoenix 是 Arize AI 推出的开源可观测与评估库，原生 OpenTelemetry，评估原语强大——面向希望 Notebook 与生产栈统一的数据科学/ML 工程师。

Why Phoenix

Phoenix leans toward experimentation and eval. A data scientist can spin up Phoenix in a notebook with phoenix.launch_app(), send in OpenTelemetry traces from a RAG pipeline, run evals against datasets, and iterate in the same environment — without any server deployment. For production, the same library deploys as a long-running service with Postgres.

The eval library is a standout. Phoenix ships pre-built evaluators for hallucination, toxicity, relevance, QA correctness, and retrieval accuracy. Each is a tested prompt template you can apply at dataset-scale with a few lines of code. This is the fastest path from "I think my RAG is bad" to "here are the specific queries it fails on".

Relative to Langfuse: Phoenix is more research-tool, less product-tool. Its UI is functional but less polished; its prompt management is lighter; its focus is on helping you diagnose and improve more than on operating a production ops dashboard. Many teams use both — Phoenix in notebooks during development, Langfuse in production.

Quick Start — Notebook Launch + OpenAI

launch_app() is the notebook-friendly mode — Phoenix runs in-process with an HTTP endpoint for OTEL ingestion and a web UI. For production, deploy Phoenix Server (docker-compose) and point instrumentation at it. OpenInference is Arize’s OTEL instrumentation library — supports OpenAI, Anthropic, LangChain, LlamaIndex, DSPy, Haystack, LiteLLM out of the box.

# pip install 'arize-phoenix[evals]' openinference-instrumentation-openai opentelemetry-sdk
import phoenix as px

# Launch the Phoenix UI locally (notebook or script)
session = px.launch_app()
print(session.url)   # open in browser

# Instrument OpenAI via OpenInference (Arize's OTEL libraries for LLM frameworks)
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(f"{session.url}/v1/traces")))
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()

from openai import OpenAI
client = OpenAI()

for q in ["Why is the sky blue?", "How do planes fly?", "What is photosynthesis?"]:
    client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": q}],
    )

# Phoenix UI now shows traces with prompt/response/latency/cost per request.
# Now run an eval over the last 10 traces:
from phoenix.evals import HallucinationEvaluator, OpenAIModel, run_evals

trace_df = px.Client().get_spans_dataframe()
hallu_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4o-mini"))
scores = run_evals(dataframe=trace_df, evaluators=[hallu_evaluator])
print(scores)

核心能力

OpenTelemetry native

Ingests OTEL traces from any instrumented framework via OpenInference (Arize’s LLM-specific OTEL library). Interoperates with Jaeger, Tempo, and generic OTEL collectors.

Pre-built evaluators

Hallucination, toxicity, relevance, QA correctness, retrieval precision, code generation — LLM-as-judge evaluators with tested prompts. Apply to trace datasets with one line.

Dataset workflows

Curate datasets from production traces, tag examples, replay on new prompts or models, diff results. Tight loop between "production bug" and "fixed evaluation dataset".

Embedding + RAG diagnostics

UMAP visualization of embeddings, RAG-specific metrics (context relevance, answer relevance, groundedness). Particularly strong for debugging retrieval pipelines.

Notebook-first

launch_app() runs Phoenix in a notebook; same library runs as a production server with Postgres. Low friction between research and production.

OSS + commercial sibling

Phoenix OSS is Elastic License 2.0. For enterprise, Arize AX is the paid managed offering with role-based access, team features, and deeper analytics.

对比

	Primary Strength	Deployment	Eval Library	Audience
Arize Phoenix本工具	Eval + embedding diagnostics	Notebook + self-host	Strongest (pre-built evaluators)	Data scientists / ML engineers
Langfuse	Production ops + prompt mgmt	Cloud + self-host	LLM-as-judge framework	Production engineers
Helicone	Zero-code integration	Cloud + self-host	Basic	Full-stack teams
Traceloop	OTEL evangelism	Agent + backend	Via integrations	OTEL users

实际用例

01. RAG debugging

When retrieval quality is the problem, Phoenix’s RAG metrics (context relevance, groundedness) + embedding UMAP visualization isolate whether the issue is chunking, embeddings, or the generation prompt.

02. ML/Data Science research

Teams where LLM work lives in notebooks — experiment with prompts, run evals on held-out datasets, compare model versions. Phoenix’s notebook-first design fits the workflow.

03. Production + dev parity

Same instrumentation, same Phoenix UI, dev and production. Reduces the usual "works on my laptop, mysterious in prod" gap for LLM apps.

价格与许可

Phoenix: Elastic License 2.0 — free to use (including for commercial purposes); restrictions apply to resale as a hosted service. Full feature set available self-hosted.

Arize AX: managed enterprise offering from Arize AI. Adds SSO, team management, enterprise support, deeper analytics, enhanced dashboards. Pricing by volume — contact Arize sales.

Cost reality: self-hosted Phoenix is free for compute; you pay for the Postgres it needs plus your own LLM eval calls. For teams already set up for OTEL, marginal cost is low.

常见问题

Phoenix vs Langfuse — which is better?+

Different bets. Phoenix is stronger on eval library and embedding diagnostics; Langfuse is stronger on production ops UX and prompt management. Research/eval-heavy teams often prefer Phoenix; production engineering teams often prefer Langfuse. Many shops use both.

Is Phoenix really OpenTelemetry-compatible?+

Yes. Phoenix is an OTEL backend — accepts OTLP over HTTP/gRPC. The OpenInference library (from Arize) provides LLM-specific instrumentation on top of OTEL’s base conventions. You can mix Phoenix traces with generic OTEL traces from other sources.

Can I use Phoenix with LangChain?+

Yes. pip install openinference-instrumentation-langchain; call LangChainInstrumentor().instrument() once at startup. All LangChain components (chains, agents, retrievers) emit structured OTEL spans to Phoenix.

Do I need to deploy Phoenix Server for a small app?+

No. For dev, px.launch_app() runs an in-process server in your notebook. For production, deploy Phoenix Server with Postgres — docker-compose configs ship in the repo.

Is the Elastic License a problem for commercial use?+

For most commercial users, no — you can run Phoenix in production inside your company without issue. The license restriction targets resale as a SaaS competing with Arize. Check with your legal team for specifics if you’re building a platform product.