为什么选它
Phoenix leans toward experimentation and eval. A data scientist can spin up Phoenix in a notebook with phoenix.launch_app(), send in OpenTelemetry traces from a RAG pipeline, run evals against datasets, and iterate in the same environment — without any server deployment. For production, the same library deploys as a long-running service with Postgres.
The eval library is a standout. Phoenix ships pre-built evaluators for hallucination, toxicity, relevance, QA correctness, and retrieval accuracy. Each is a tested prompt template you can apply at dataset-scale with a few lines of code. This is the fastest path from "I think my RAG is bad" to "here are the specific queries it fails on".
Relative to Langfuse: Phoenix is more research-tool, less product-tool. Its UI is functional but less polished; its prompt management is lighter; its focus is on helping you diagnose and improve more than on operating a production ops dashboard. Many teams use both — Phoenix in notebooks during development, Langfuse in production.
Quick Start — Notebook Launch + OpenAI
launch_app() is the notebook-friendly mode — Phoenix runs in-process with an HTTP endpoint for OTEL ingestion and a web UI. For production, deploy Phoenix Server (docker-compose) and point instrumentation at it. OpenInference is Arize’s OTEL instrumentation library — supports OpenAI, Anthropic, LangChain, LlamaIndex, DSPy, Haystack, LiteLLM out of the box.
# pip install 'arize-phoenix[evals]' openinference-instrumentation-openai opentelemetry-sdk
import phoenix as px
# Launch the Phoenix UI locally (notebook or script)
session = px.launch_app()
print(session.url) # open in browser
# Instrument OpenAI via OpenInference (Arize's OTEL libraries for LLM frameworks)
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(f"{session.url}/v1/traces")))
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()
from openai import OpenAI
client = OpenAI()
for q in ["Why is the sky blue?", "How do planes fly?", "What is photosynthesis?"]:
client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": q}],
)
# Phoenix UI now shows traces with prompt/response/latency/cost per request.
# Now run an eval over the last 10 traces:
from phoenix.evals import HallucinationEvaluator, OpenAIModel, run_evals
trace_df = px.Client().get_spans_dataframe()
hallu_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4o-mini"))
scores = run_evals(dataframe=trace_df, evaluators=[hallu_evaluator])
print(scores)核心能力
OpenTelemetry native
Ingests OTEL traces from any instrumented framework via OpenInference (Arize’s LLM-specific OTEL library). Interoperates with Jaeger, Tempo, and generic OTEL collectors.
Pre-built evaluators
Hallucination, toxicity, relevance, QA correctness, retrieval precision, code generation — LLM-as-judge evaluators with tested prompts. Apply to trace datasets with one line.
Dataset workflows
Curate datasets from production traces, tag examples, replay on new prompts or models, diff results. Tight loop between "production bug" and "fixed evaluation dataset".
Embedding + RAG diagnostics
UMAP visualization of embeddings, RAG-specific metrics (context relevance, answer relevance, groundedness). Particularly strong for debugging retrieval pipelines.
Notebook-first
launch_app() runs Phoenix in a notebook; same library runs as a production server with Postgres. Low friction between research and production.
OSS + commercial sibling
Phoenix OSS is Elastic License 2.0. For enterprise, Arize AX is the paid managed offering with role-based access, team features, and deeper analytics.
对比
| Primary Strength | Deployment | Eval Library | Audience | |
|---|---|---|---|---|
| Arize Phoenixthis | Eval + embedding diagnostics | Notebook + self-host | Strongest (pre-built evaluators) | Data scientists / ML engineers |
| Langfuse | Production ops + prompt mgmt | Cloud + self-host | LLM-as-judge framework | Production engineers |
| Helicone | Zero-code integration | Cloud + self-host | Basic | Full-stack teams |
| Traceloop | OTEL evangelism | Agent + backend | Via integrations | OTEL users |
实际用例
01. RAG debugging
When retrieval quality is the problem, Phoenix’s RAG metrics (context relevance, groundedness) + embedding UMAP visualization isolate whether the issue is chunking, embeddings, or the generation prompt.
02. ML/Data Science research
Teams where LLM work lives in notebooks — experiment with prompts, run evals on held-out datasets, compare model versions. Phoenix’s notebook-first design fits the workflow.
03. Production + dev parity
Same instrumentation, same Phoenix UI, dev and production. Reduces the usual "works on my laptop, mysterious in prod" gap for LLM apps.
价格与许可
Phoenix: Elastic License 2.0 — free to use (including for commercial purposes); restrictions apply to resale as a hosted service. Full feature set available self-hosted.
Arize AX: managed enterprise offering from Arize AI. Adds SSO, team management, enterprise support, deeper analytics, enhanced dashboards. Pricing by volume — contact Arize sales.
Cost reality: self-hosted Phoenix is free for compute; you pay for the Postgres it needs plus your own LLM eval calls. For teams already set up for OTEL, marginal cost is low.
相关 TokRepo 资产
Fabric — AI Prompt Patterns for Everything
Collection of 100+ AI prompt patterns for real-world tasks. Summarize articles, extract wisdom, analyze code, write essays, create presentations, and more.
Docker (Moby) — The Container Platform That Changed DevOps
Docker is the platform that popularized containerization. It packages applications with their dependencies into standardized containers that run consistently everywhere. Moby is the open-source project behind Docker Engine, the runtime that powers container-based development and deployment.
常见问题
Phoenix vs Langfuse — which is better?+
Different bets. Phoenix is stronger on eval library and embedding diagnostics; Langfuse is stronger on production ops UX and prompt management. Research/eval-heavy teams often prefer Phoenix; production engineering teams often prefer Langfuse. Many shops use both.
Is Phoenix really OpenTelemetry-compatible?+
Yes. Phoenix is an OTEL backend — accepts OTLP over HTTP/gRPC. The OpenInference library (from Arize) provides LLM-specific instrumentation on top of OTEL’s base conventions. You can mix Phoenix traces with generic OTEL traces from other sources.
Can I use Phoenix with LangChain?+
Yes. pip install openinference-instrumentation-langchain; call LangChainInstrumentor().instrument() once at startup. All LangChain components (chains, agents, retrievers) emit structured OTEL spans to Phoenix.
Do I need to deploy Phoenix Server for a small app?+
No. For dev, px.launch_app() runs an in-process server in your notebook. For production, deploy Phoenix Server with Postgres — docker-compose configs ship in the repo.
Is the Elastic License a problem for commercial use?+
For most commercial users, no — you can run Phoenix in production inside your company without issue. The license restriction targets resale as a SaaS competing with Arize. Check with your legal team for specifics if you’re building a platform product.