Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 3, 2026·2 min de lectura

Opik — Debug, Evaluate & Monitor LLM Apps

Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.

AI Open Source · Community

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Established

Entrada

opik.md

Comando de instalación directa

npx -y tokrepo@latest install a543eba5-fe14-46f3-9aa5-96a5a23b72d0 --target codex

Ejecutar después de confirmar el plan con dry-run.

TL;DR

Opik traces LLM calls, runs evals, and monitors RAG quality in production.

§01

What it is

Opik is an open-source LLM observability platform by Comet that provides tracing, evaluation, and production monitoring for AI applications. It instruments LLM calls with a single decorator, runs automated quality evaluations on your outputs, and monitors RAG retrieval quality and agent behavior in production.

It targets AI engineers building production LLM applications who need to debug issues, measure output quality systematically, and catch regressions before users report them.

§02

How it saves time or tokens

Opik surfaces the root cause of quality issues faster than manual debugging. The tracing view shows exactly which step in a multi-step chain produced a bad output, with token counts, latency, and cost for each step. Automated evaluations run continuously, so you know when prompt changes improve or degrade quality. For RAG applications, Opik evaluates retrieval relevance and generation faithfulness, identifying where tokens are wasted on irrelevant context.

§03

How to use

Install and configure:

pip install opik
opik configure

Add tracing with one decorator:

import opik

@opik.track()
def generate_answer(question: str):
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{'role': 'user', 'content': question}]
    )
    return response.choices[0].message.content

Run evaluations on your dataset:

from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance

results = evaluate(
    experiment_name='qa-v2',
    dataset=my_dataset,
    task=generate_answer,
    scoring_metrics=[Hallucination(), AnswerRelevance()]
)

§04

Example

import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance

@opik.track()
def rag_pipeline(query: str):
    # Retrieve relevant documents
    docs = vector_store.similarity_search(query, k=3)
    context = '\n'.join([d.page_content for d in docs])
    
    # Generate answer with context
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {'role': 'system', 'content': f'Context: {context}'},
            {'role': 'user', 'content': query}
        ]
    )
    return response.choices[0].message.content

# Evaluate the RAG pipeline
results = evaluate(
    experiment_name='rag-eval-v1',
    dataset=test_questions,
    task=rag_pipeline,
    scoring_metrics=[Hallucination(), AnswerRelevance()]
)

§05

Related on TokRepo

AI tools for monitoring -- LLM monitoring and observability platforms
AI tools for testing -- Testing and evaluation frameworks for AI

§06

Common pitfalls

Evaluation metrics like Hallucination use an LLM judge, which adds API costs. Run evaluations on representative samples rather than entire datasets to control costs.
Tracing in production generates significant data volume. Configure sampling rates for high-traffic applications to keep storage and costs manageable.
Custom metrics require understanding of the scoring API. Start with built-in metrics (Hallucination, AnswerRelevance, Moderation) before writing custom evaluators.

Preguntas frecuentes

How does Opik compare to LangSmith?+

Both provide LLM tracing and evaluation. Opik is open-source and self-hostable, while LangSmith is a commercial product. Opik works with any LLM library and does not require LangChain. Both offer trace visualization, evaluation frameworks, and production monitoring.

Can I self-host Opik?+

Yes. Opik is open-source and can be self-hosted using Docker. The self-hosted version includes the full tracing, evaluation, and dashboard functionality. Comet also offers a cloud-hosted version with additional features and managed infrastructure.

What evaluation metrics does Opik provide?+

Opik includes built-in metrics for Hallucination, AnswerRelevance, ContextPrecision, ContextRecall, and Moderation. These use LLM-as-judge patterns to score outputs. You can also define custom metrics using Python functions for domain-specific quality criteria.

Does Opik work with RAG applications?+

Yes. Opik is designed with RAG in mind. It traces both the retrieval and generation steps, evaluates retrieval relevance (ContextPrecision, ContextRecall), and checks generation faithfulness (Hallucination). This gives you end-to-end visibility into RAG pipeline quality.

Which LLM providers does Opik support?+

Opik works with any LLM provider. The @opik.track() decorator wraps your existing code regardless of provider. It also provides direct integrations with LangChain, LlamaIndex, and OpenAI for automatic tracing without decorators.

Referencias (3)

Opik GitHub Repository— Opik is an open-source LLM observability platform by Comet
Opik Documentation— Opik provides built-in evaluation metrics for hallucination and relevance
Judging LLM-as-a-Judge Paper— LLM-as-judge evaluation patterns for automated output quality assessment

Relacionados en TokRepo

Monitoring tools Testing tools Langfuse

🙏

Fuente y agradecimientos

Created by Comet ML. Licensed under Apache-2.0.

opik — ⭐ 18,600+

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

TruLens — Evaluate and Track LLM Apps

Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow.

Skills

Agent Toolkit

Ragas — Evaluate RAG & LLM Applications

Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.

Skills

Script Depot

Evidently — ML & LLM Monitoring with 100+ Metrics

Evaluate, test, and monitor AI systems with 100+ built-in metrics for data drift, model quality, and LLM output. 7.3K+ stars.

Skills

AI Open Source

Weave — Trace and Debug LLM Apps

Weave adds tracing to LLM apps with `@weave.op`. Install `weave`, call `weave.init()`, then track inputs/outputs across API calls and validation steps.

Knowledge

Agent Toolkit