What is Opik — Debug, Evaluate & Monitor LLM Apps?

Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.

Is Opik — Debug, Evaluate & Monitor LLM Apps free to use?

Yes. Opik — Debug, Evaluate & Monitor LLM Apps is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Opik — Debug, Evaluate & Monitor LLM Apps?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Opik — Debug, Evaluate & Monitor LLM Apps

Opik Features

Tracing

import opik

@opik.track
def rag_pipeline(query: str):
    docs = retrieve(query)       # Traced as child span
    context = format(docs)       # Traced as child span
    answer = generate(query, context)  # Traced as child span
    return answer

# Dashboard shows full trace tree:
# rag_pipeline (2.3s)
#   ├─ retrieve (0.5s) - 8 docs found
#   ├─ format (0.1s)
#   └─ generate (1.7s) - 342 tokens, $0.005

Automated Evaluation (20+ Metrics)

from opik.evaluation.metrics import Hallucination, AnswerRelevance, ContextPrecision

# Evaluate your RAG pipeline
results = opik.evaluate(
    dataset="qa-test-set",
    task=rag_pipeline,
    scoring_metrics=[
        Hallucination(),
        AnswerRelevance(),
        ContextPrecision(),
    ]
)
print(results.summary())
# Hallucination: 0.12 | Relevance: 0.89 | Precision: 0.85

Built-in metrics:

Hallucination — Detects fabricated information
Answer Relevance — Does the answer match the question?
Context Precision — Is retrieved context relevant?
Faithfulness — Is the answer supported by context?
Moderation — Toxicity, bias, PII detection
Custom — Write your own Python scoring functions

Dataset Management

# Create evaluation datasets from production traces
dataset = opik.Dataset(name="regression-tests")
dataset.insert([
    {"input": "What is RAG?", "expected": "Retrieval Augmented Generation..."},
    {"input": "How does fine-tuning work?", "expected": "Fine-tuning adjusts..."},
])

# Run evaluations on every deployment
results = opik.evaluate(dataset=dataset, task=my_pipeline)

Framework Integrations

# LangChain
from opik.integrations.langchain import OpikTracer
callbacks = [OpikTracer()]
chain.invoke(input, config={"callbacks": callbacks})

# LlamaIndex
from opik.integrations.llama_index import LlamaIndexCallbackHandler
handler = LlamaIndexCallbackHandler()

# OpenAI directly
from opik.integrations.openai import track_openai
client = track_openai(OpenAI())

FAQ

Q: What is Opik? A: Opik is an open-source LLM evaluation and observability platform by Comet with 18,600+ GitHub stars. It provides tracing, 20+ automated evaluation metrics, dataset management, and production monitoring for LLM applications.

Q: How is Opik different from Langfuse? A: Both provide LLM tracing and observability. Opik has stronger evaluation features (20+ built-in metrics, automated eval pipelines). Langfuse focuses more on prompt management. Opik is backed by Comet (established MLOps company).

Q: Is Opik free? A: Yes, open-source under Apache-2.0. Self-host for free. Comet also offers a managed cloud version.

Opik — Debug, Evaluate & Monitor LLM Apps

Use it first, then decide how deep to go

Opik Features

Tracing

Automated Evaluation (20+ Metrics)

Dataset Management

Framework Integrations

FAQ

Source & Thanks

Discussion