Langfuse — Open Source LLM Observability
Langfuse is an open-source LLM engineering platform for tracing, prompt management, evaluation, and debugging AI apps. 24.1K+ GitHub stars. Self-hosted or cloud. MIT.
What it is
Langfuse is an open-source LLM engineering platform that provides observability for AI applications. It traces every LLM call, tracks token usage and latency, manages prompt versions, and supports evaluation workflows. You integrate it with a few lines of code and get a dashboard showing how your AI application performs in production.
Langfuse targets AI engineers, ML teams, and product developers who build LLM-powered applications and need to understand cost, quality, and performance. It is available as a cloud service or self-hosted under the MIT license.
How it saves time or tokens
Without observability, debugging LLM applications means adding print statements, manually counting tokens, and guessing why outputs degrade. Langfuse automatically traces every call, records inputs/outputs, measures latency, and calculates costs. Prompt management lets you version and A/B test prompts without code changes. This visibility helps you identify expensive or slow calls and optimize them, directly reducing token waste.
How to use
- Install the SDK:
pip install langfuse openai
- Add tracing with a one-line import swap:
from langfuse.openai import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model='gpt-4o',
messages=[{'role': 'user', 'content': 'Hello'}]
)
- View traces in the Langfuse dashboard at
cloud.langfuse.comor your self-hosted instance.
- Use the prompt management UI to version and deploy prompts without redeploying code.
Example
from langfuse import Langfuse
langfuse = Langfuse()
# Create a trace for a multi-step workflow
trace = langfuse.trace(name='rag-pipeline')
# Span for retrieval step
retrieval = trace.span(name='retrieval')
# ... your retrieval logic ...
retrieval.end(output={'docs_found': 5})
# Generation span for LLM call
generation = trace.generation(
name='answer-generation',
model='gpt-4o',
input=[{'role': 'user', 'content': 'question'}]
)
# ... your LLM call ...
generation.end(output='answer text', usage={'input': 150, 'output': 200})
Related on TokRepo
- Langfuse on AI Gateway -- Langfuse as an observability layer
- Monitoring Tools -- Observability and monitoring tools
Common pitfalls
- Tracing adds a small latency overhead per call. For latency-sensitive applications, use async flushing (enabled by default) and batch spans.
- Self-hosted Langfuse requires PostgreSQL and ClickHouse. Plan for database maintenance and storage growth as trace volume increases.
- Prompt management works best when prompts are fetched at runtime. Hardcoded prompts in code bypass the versioning system entirely.
Frequently Asked Questions
Langfuse is open source under the MIT license. Self-hosting is completely free. The cloud-hosted version has a free tier with usage limits and paid plans for higher volume.
Langfuse integrates with OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, and any provider via the generic SDK. Framework integrations exist for LangChain, LlamaIndex, and Haystack.
Yes. Langfuse provides Docker images and Helm charts for self-hosting. It requires PostgreSQL and ClickHouse. The self-hosted version has full feature parity with the cloud version.
LangSmith is LangChain's proprietary observability platform. Langfuse is open source, framework-agnostic, and self-hostable. If you use LangChain exclusively, LangSmith has deeper integration. If you want vendor independence, Langfuse is the better choice.
Yes. Langfuse supports manual annotation, model-based evaluation, and custom scoring functions. You can define evaluation criteria and score traces programmatically or through the UI.
Citations (3)
- Langfuse GitHub— Open-source LLM engineering platform, MIT license
- Langfuse Documentation— Tracing, prompt management, and evaluation features
- Anthropic Observability Guide— LLM observability best practices
Related on TokRepo
Source & Thanks
Created by Langfuse. Licensed under MIT. langfuse/langfuse — 24,100+ GitHub stars
Discussion
Related Assets
Cucumber.js — BDD Testing with Plain Language Scenarios
Cucumber.js is a JavaScript implementation of Cucumber that runs automated tests written in Gherkin plain language.
WireMock — Flexible API Mocking for Java and Beyond
WireMock is an HTTP mock server for stubbing and verifying API calls in integration tests and development.
Google Benchmark — Microbenchmark Library for C++
Google Benchmark is a library for measuring and reporting the performance of C++ code with statistical rigor.