LLM Observability

Langfuse — Open-source LLM Engineering Platform

Langfuse is the dominant open-source platform for LLM traces, prompts, evaluations, and datasets. Instrument your agent with the SDK or OpenTelemetry and get production-grade debugging and eval.

Official Site GitHub

Why Langfuse

Langfuse wins on trace depth and eval integration. A multi-step agent with retrieval, tool calls, and LLM calls produces a nested span tree you can drill into — input/output at every level, cost and latency rolled up, errors attached to the right span. It’s the closest thing to "APM for LLM apps" that the ecosystem has.

The platform bundles four tightly-linked products: traces, prompt management (versioned prompts with deployment labels), evaluations (LLM-as-judge, user feedback, custom scores), and datasets (curate examples from real traces, replay them on new prompts). They fit together because they share the same trace model — no CSV exports and imports between tools.

Where Langfuse asks for more than Helicone: you must instrument. Either add the SDK decorator to your functions, or configure OpenTelemetry. For existing codebases that can spare an afternoon of instrumentation, the payoff is orders-of-magnitude richer data than proxy-based observability provides.

Quick Start — Python SDK with OpenAI

The @observe() decorator creates a span for any function. langfuse.openai wraps the OpenAI SDK so every call becomes an automatic child span with prompt, response, usage, and cost. For non-OpenAI providers, use Langfuse’s generic SDK or the OpenTelemetry instrumentation.

# pip install langfuse openai
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"  # or self-host URL

# The drop-in: langfuse-wrapped OpenAI client → every call auto-traced
from langfuse.openai import openai  # instead of: from openai import OpenAI

client = openai.OpenAI()

def greet(name: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Greet {name} in one sentence."}],
        name="greet",  # span name shown in Langfuse UI
    )
    return resp.choices[0].message.content

# Trace a whole agent turn with nested spans
from langfuse.decorators import observe

@observe()
def agent(user_msg: str) -> str:
    greeting = greet("William")           # nested LLM span
    return f"{greeting}\nYou said: {user_msg}"

print(agent("Hello from Langfuse"))
# Langfuse UI now shows a tree: agent → greet (LLM call, prompt/response, cost, latency).

Key Features

Nested trace model

Parent/child spans capture multi-step agents: retrievers, tool calls, chained LLM calls. Drill into each span for input/output/cost/latency.

Prompt management

Prompts stored as versioned objects with labels (production/staging/dev). Reference by name from code; deploy new versions without redeploy.

LLM-as-judge evals

Configure eval prompts that score outputs along axes (helpfulness, factuality, format adherence). Scores attach to traces automatically.

Datasets from production

Promote real traces to datasets. Replay them on new prompts or models to measure regression before deploy.

User feedback capture

Attach thumbs-up/down or free-text feedback to trace IDs. Slice metrics by user sentiment to find regressions fast.

OpenTelemetry compatible

Ingests OTEL traces from any instrumented framework (LangChain, LlamaIndex, CrewAI, custom). Works alongside existing APM tooling.

Comparison

	Trace Depth	Evals	Prompt Mgmt	Deployment
Langfusethis	Nested spans + OTEL	Built-in LLM-as-judge	First-class	Cloud + self-host (free OSS)
Helicone	Per-request (flat)	Via experiments	Yes	Cloud + self-host
Arize Phoenix	Span-level (OTEL native)	Strong eval suite	Via playground	Cloud + self-host
Portkey	Request-level + metadata	Limited	Yes (strong)	Cloud + self-host gateway

Use Cases

01. Production agent debugging

Nested traces are invaluable when a multi-step agent produces wrong output — you can see which tool call returned bad data, not just "the final answer is wrong".

02. Prompt engineering workflows

Promote a real production trace to a dataset, iterate on prompts in the Langfuse playground, run evals before deploying. The round-trip from "bug report" to "fixed prompt" is days faster than with separate tools.

03. Enterprise self-host

Full stack is MIT-licensed. Teams with data-residency requirements deploy Langfuse in their own cloud and point agents at it — no data leaves their perimeter.

Pricing & License

Langfuse: MIT open source. Self-host for free — full feature parity with cloud (except managed upgrades and support).

Langfuse Cloud: free tier for dev; paid plans by event volume. Enterprise adds SSO, SAML, SOC 2, dedicated support, and deployment automation. Pricing at langfuse.com/pricing.

Infra cost for self-host: Postgres + ClickHouse + worker. Moderate ops load; the OSS docker-compose gets you running in 15 minutes for dev. Production scale-out requires familiarity with ClickHouse.

Related Assets on TokRepo

langfuse-mcp — Query Langfuse Traces via MCP

Connect Langfuse observability to Claude Code/Codex via MCP: fetch traces, prompts, and datasets (37 tools). Works with Langfuse Cloud or self-hosted.

Langfuse Python SDK — Trace LLM Apps

Langfuse Python SDK adds tracing and observability to any LLM app via decorators or low-level calls, so you can track latency, cost, and prompts.

Langfuse Self-Hosting — Production Docker Compose Stack

Production Docker Compose for self-hosted Langfuse v3. Postgres, Clickhouse, Redis, MinIO, Worker, Web. Auth, S3 logs, daily backup.

Langfuse Prompt Management — Versioned Prompts + A/B Tests

Langfuse Prompt Management versions, labels, and A/B tests prompts. Edit in UI, fetch via SDK, swap models without code deploys.

Frequently Asked Questions

Do I need to rewrite my code to use Langfuse?+

For OpenAI callers, no — just swap the import (langfuse.openai instead of openai) and all calls become traced. For custom providers or agent frameworks, add the @observe decorator to functions or configure OpenTelemetry. Less zero-touch than Helicone, more instrumentation depth.

Langfuse vs Arize Phoenix?+

Both are open-source LLM observability. Langfuse is more of a full platform (traces + prompts + evals + datasets tightly linked); Phoenix is more focused on notebooks and eval experimentation. Langfuse for production ops; Phoenix for researcher / data-scientist workflows.

Does Langfuse work with LangChain / LlamaIndex / CrewAI?+

Yes — first-class callback integrations for each. LangChain LangSmith users often coexist with Langfuse (Langfuse as OSS traces, LangSmith for managed).

How do LLM-as-judge evals work?+

You define an eval prompt that scores outputs on some axis (e.g., "rate the factuality of this summary 1-5"). Langfuse runs the eval against traces (offline or online), attaches scores to traces, and surfaces aggregates in dashboards. Good for continuous quality monitoring.

Is self-hosted Langfuse production-ready?+

Yes — used by many teams in production. Requires Postgres + ClickHouse + Redis. Docker-compose exists for dev; production deploys typically use Kubernetes with managed Postgres + ClickHouse Cloud or self-hosted ClickHouse.

Compare Alternatives

Helicone — Zero-Code LLM Observability Platform Arize Phoenix — Open-source LLM Observability & Evals Traceloop — OpenTelemetry-first LLM Observability Portkey — AI Gateway with Prompt Management & Observability