ConfigsApr 3, 2026·2 min read

Opik — Debug, Evaluate & Monitor LLM Apps

Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.

TL;DR
Opik traces LLM calls, runs evals, and monitors RAG quality in production.
§01

What it is

Opik is an open-source LLM observability platform by Comet that provides tracing, evaluation, and production monitoring for AI applications. It instruments LLM calls with a single decorator, runs automated quality evaluations on your outputs, and monitors RAG retrieval quality and agent behavior in production.

It targets AI engineers building production LLM applications who need to debug issues, measure output quality systematically, and catch regressions before users report them.

§02

How it saves time or tokens

Opik surfaces the root cause of quality issues faster than manual debugging. The tracing view shows exactly which step in a multi-step chain produced a bad output, with token counts, latency, and cost for each step. Automated evaluations run continuously, so you know when prompt changes improve or degrade quality. For RAG applications, Opik evaluates retrieval relevance and generation faithfulness, identifying where tokens are wasted on irrelevant context.

§03

How to use

  1. Install and configure:
pip install opik
opik configure
  1. Add tracing with one decorator:
import opik

@opik.track()
def generate_answer(question: str):
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{'role': 'user', 'content': question}]
    )
    return response.choices[0].message.content
  1. Run evaluations on your dataset:
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance

results = evaluate(
    experiment_name='qa-v2',
    dataset=my_dataset,
    task=generate_answer,
    scoring_metrics=[Hallucination(), AnswerRelevance()]
)
§04

Example

import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance

@opik.track()
def rag_pipeline(query: str):
    # Retrieve relevant documents
    docs = vector_store.similarity_search(query, k=3)
    context = '\n'.join([d.page_content for d in docs])
    
    # Generate answer with context
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {'role': 'system', 'content': f'Context: {context}'},
            {'role': 'user', 'content': query}
        ]
    )
    return response.choices[0].message.content

# Evaluate the RAG pipeline
results = evaluate(
    experiment_name='rag-eval-v1',
    dataset=test_questions,
    task=rag_pipeline,
    scoring_metrics=[Hallucination(), AnswerRelevance()]
)
§05

Related on TokRepo

§06

Common pitfalls

  • Evaluation metrics like Hallucination use an LLM judge, which adds API costs. Run evaluations on representative samples rather than entire datasets to control costs.
  • Tracing in production generates significant data volume. Configure sampling rates for high-traffic applications to keep storage and costs manageable.
  • Custom metrics require understanding of the scoring API. Start with built-in metrics (Hallucination, AnswerRelevance, Moderation) before writing custom evaluators.

Frequently Asked Questions

How does Opik compare to LangSmith?+

Both provide LLM tracing and evaluation. Opik is open-source and self-hostable, while LangSmith is a commercial product. Opik works with any LLM library and does not require LangChain. Both offer trace visualization, evaluation frameworks, and production monitoring.

Can I self-host Opik?+

Yes. Opik is open-source and can be self-hosted using Docker. The self-hosted version includes the full tracing, evaluation, and dashboard functionality. Comet also offers a cloud-hosted version with additional features and managed infrastructure.

What evaluation metrics does Opik provide?+

Opik includes built-in metrics for Hallucination, AnswerRelevance, ContextPrecision, ContextRecall, and Moderation. These use LLM-as-judge patterns to score outputs. You can also define custom metrics using Python functions for domain-specific quality criteria.

Does Opik work with RAG applications?+

Yes. Opik is designed with RAG in mind. It traces both the retrieval and generation steps, evaluates retrieval relevance (ContextPrecision, ContextRecall), and checks generation faithfulness (Hallucination). This gives you end-to-end visibility into RAG pipeline quality.

Which LLM providers does Opik support?+

Opik works with any LLM provider. The @opik.track() decorator wraps your existing code regardless of provider. It also provides direct integrations with LangChain, LlamaIndex, and OpenAI for automatic tracing without decorators.

Citations (3)
🙏

Source & Thanks

Created by Comet ML. Licensed under Apache-2.0.

opik — ⭐ 18,600+

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets