Langfuse — Open-source LLM Engineering Platform
Langfuse is the dominant open-source platform for LLM traces, prompts, evaluations, and datasets. Instrument your agent with the SDK or OpenTelemetry and get production-grade debugging and eval.
Why Langfuse
Langfuse wins on trace depth and eval integration. A multi-step agent with retrieval, tool calls, and LLM calls produces a nested span tree you can drill into — input/output at every level, cost and latency rolled up, errors attached to the right span. It’s the closest thing to "APM for LLM apps" that the ecosystem has.
The platform bundles four tightly-linked products: traces, prompt management (versioned prompts with deployment labels), evaluations (LLM-as-judge, user feedback, custom scores), and datasets (curate examples from real traces, replay them on new prompts). They fit together because they share the same trace model — no CSV exports and imports between tools.
Where Langfuse asks for more than Helicone: you must instrument. Either add the SDK decorator to your functions, or configure OpenTelemetry. For existing codebases that can spare an afternoon of instrumentation, the payoff is orders-of-magnitude richer data than proxy-based observability provides.
Quick Start — Python SDK with OpenAI
The @observe() decorator creates a span for any function. langfuse.openai wraps the OpenAI SDK so every call becomes an automatic child span with prompt, response, usage, and cost. For non-OpenAI providers, use Langfuse’s generic SDK or the OpenTelemetry instrumentation.
# pip install langfuse openai
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # or self-host URL
# The drop-in: langfuse-wrapped OpenAI client → every call auto-traced
from langfuse.openai import openai # instead of: from openai import OpenAI
client = openai.OpenAI()
def greet(name: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Greet {name} in one sentence."}],
name="greet", # span name shown in Langfuse UI
)
return resp.choices[0].message.content
# Trace a whole agent turn with nested spans
from langfuse.decorators import observe
@observe()
def agent(user_msg: str) -> str:
greeting = greet("William") # nested LLM span
return f"{greeting}\nYou said: {user_msg}"
print(agent("Hello from Langfuse"))
# Langfuse UI now shows a tree: agent → greet (LLM call, prompt/response, cost, latency).Key Features
Nested trace model
Parent/child spans capture multi-step agents: retrievers, tool calls, chained LLM calls. Drill into each span for input/output/cost/latency.
Prompt management
Prompts stored as versioned objects with labels (production/staging/dev). Reference by name from code; deploy new versions without redeploy.
LLM-as-judge evals
Configure eval prompts that score outputs along axes (helpfulness, factuality, format adherence). Scores attach to traces automatically.
Datasets from production
Promote real traces to datasets. Replay them on new prompts or models to measure regression before deploy.
User feedback capture
Attach thumbs-up/down or free-text feedback to trace IDs. Slice metrics by user sentiment to find regressions fast.
OpenTelemetry compatible
Ingests OTEL traces from any instrumented framework (LangChain, LlamaIndex, CrewAI, custom). Works alongside existing APM tooling.
Comparison
| Trace Depth | Evals | Prompt Mgmt | Deployment | |
|---|---|---|---|---|
| Langfusethis | Nested spans + OTEL | Built-in LLM-as-judge | First-class | Cloud + self-host (free OSS) |
| Helicone | Per-request (flat) | Via experiments | Yes | Cloud + self-host |
| Arize Phoenix | Span-level (OTEL native) | Strong eval suite | Via playground | Cloud + self-host |
| Portkey | Request-level + metadata | Limited | Yes (strong) | Cloud + self-host gateway |
Use Cases
01. Production agent debugging
Nested traces are invaluable when a multi-step agent produces wrong output — you can see which tool call returned bad data, not just "the final answer is wrong".
02. Prompt engineering workflows
Promote a real production trace to a dataset, iterate on prompts in the Langfuse playground, run evals before deploying. The round-trip from "bug report" to "fixed prompt" is days faster than with separate tools.
03. Enterprise self-host
Full stack is MIT-licensed. Teams with data-residency requirements deploy Langfuse in their own cloud and point agents at it — no data leaves their perimeter.
Pricing & License
Langfuse: MIT open source. Self-host for free — full feature parity with cloud (except managed upgrades and support).
Langfuse Cloud: free tier for dev; paid plans by event volume. Enterprise adds SSO, SAML, SOC 2, dedicated support, and deployment automation. Pricing at langfuse.com/pricing.
Infra cost for self-host: Postgres + ClickHouse + worker. Moderate ops load; the OSS docker-compose gets you running in 15 minutes for dev. Production scale-out requires familiarity with ClickHouse.
Related Assets on TokRepo
Langfuse — Open Source LLM Observability
Langfuse is an open-source LLM engineering platform for tracing, prompt management, evaluation, and debugging AI apps. 24.1K+ GitHub stars. Self-hosted or cloud. MIT.
LangFuse — Open Source LLM Observability & Tracing
Trace, evaluate, and monitor LLM applications in production. Open-source alternative to LangSmith with prompt management, cost tracking, and evaluation pipelines.
Frequently Asked Questions
Do I need to rewrite my code to use Langfuse?+
For OpenAI callers, no — just swap the import (langfuse.openai instead of openai) and all calls become traced. For custom providers or agent frameworks, add the @observe decorator to functions or configure OpenTelemetry. Less zero-touch than Helicone, more instrumentation depth.
Langfuse vs Arize Phoenix?+
Both are open-source LLM observability. Langfuse is more of a full platform (traces + prompts + evals + datasets tightly linked); Phoenix is more focused on notebooks and eval experimentation. Langfuse for production ops; Phoenix for researcher / data-scientist workflows.
Does Langfuse work with LangChain / LlamaIndex / CrewAI?+
Yes — first-class callback integrations for each. LangChain LangSmith users often coexist with Langfuse (Langfuse as OSS traces, LangSmith for managed).
How do LLM-as-judge evals work?+
You define an eval prompt that scores outputs on some axis (e.g., "rate the factuality of this summary 1-5"). Langfuse runs the eval against traces (offline or online), attaches scores to traces, and surfaces aggregates in dashboards. Good for continuous quality monitoring.
Is self-hosted Langfuse production-ready?+
Yes — used by many teams in production. Requires Postgres + ClickHouse + Redis. Docker-compose exists for dev; production deploys typically use Kubernetes with managed Postgres + ClickHouse Cloud or self-hosted ClickHouse.