Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsApr 3, 2026·2 min de lecture

Opik — Debug, Evaluate & Monitor LLM Apps

Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.

Introduction

Opik is an open-source LLM evaluation and observability platform by Comet with 18,600+ GitHub stars. It provides end-to-end tracing for LLM calls, automated evaluation with 20+ built-in metrics, dataset management for regression testing, and production monitoring dashboards. A single @opik.track decorator captures everything — inputs, outputs, latency, token usage, and costs. Opik integrates with LangChain, LlamaIndex, OpenAI, Anthropic, and major agent frameworks, giving teams full visibility into their AI application quality.

Works with: OpenAI, Anthropic, LangChain, LlamaIndex, CrewAI, Haystack, Bedrock. Best for teams running LLM apps in production who need evaluation and monitoring. Setup time: under 3 minutes.


Opik Features

Tracing

import opik

@opik.track
def rag_pipeline(query: str):
    docs = retrieve(query)       # Traced as child span
    context = format(docs)       # Traced as child span
    answer = generate(query, context)  # Traced as child span
    return answer

# Dashboard shows full trace tree:
# rag_pipeline (2.3s)
#   ├─ retrieve (0.5s) - 8 docs found
#   ├─ format (0.1s)
#   └─ generate (1.7s) - 342 tokens, $0.005

Automated Evaluation (20+ Metrics)

from opik.evaluation.metrics import Hallucination, AnswerRelevance, ContextPrecision

# Evaluate your RAG pipeline
results = opik.evaluate(
    dataset="qa-test-set",
    task=rag_pipeline,
    scoring_metrics=[
        Hallucination(),
        AnswerRelevance(),
        ContextPrecision(),
    ]
)
print(results.summary())
# Hallucination: 0.12 | Relevance: 0.89 | Precision: 0.85

Built-in metrics:

  • Hallucination — Detects fabricated information
  • Answer Relevance — Does the answer match the question?
  • Context Precision — Is retrieved context relevant?
  • Faithfulness — Is the answer supported by context?
  • Moderation — Toxicity, bias, PII detection
  • Custom — Write your own Python scoring functions

Dataset Management

# Create evaluation datasets from production traces
dataset = opik.Dataset(name="regression-tests")
dataset.insert([
    {"input": "What is RAG?", "expected": "Retrieval Augmented Generation..."},
    {"input": "How does fine-tuning work?", "expected": "Fine-tuning adjusts..."},
])

# Run evaluations on every deployment
results = opik.evaluate(dataset=dataset, task=my_pipeline)

Framework Integrations

# LangChain
from opik.integrations.langchain import OpikTracer
callbacks = [OpikTracer()]
chain.invoke(input, config={"callbacks": callbacks})

# LlamaIndex
from opik.integrations.llama_index import LlamaIndexCallbackHandler
handler = LlamaIndexCallbackHandler()

# OpenAI directly
from opik.integrations.openai import track_openai
client = track_openai(OpenAI())

FAQ

Q: What is Opik? A: Opik is an open-source LLM evaluation and observability platform by Comet with 18,600+ GitHub stars. It provides tracing, 20+ automated evaluation metrics, dataset management, and production monitoring for LLM applications.

Q: How is Opik different from Langfuse? A: Both provide LLM tracing and observability. Opik has stronger evaluation features (20+ built-in metrics, automated eval pipelines). Langfuse focuses more on prompt management. Opik is backed by Comet (established MLOps company).

Q: Is Opik free? A: Yes, open-source under Apache-2.0. Self-host for free. Comet also offers a managed cloud version.


🙏

Source et remerciements

Created by Comet ML. Licensed under Apache-2.0.

opik — ⭐ 18,600+

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires