Opik — Debug, Evaluate & Monitor LLM Apps
Trace LLM calls, run automated evaluations, and monitor RAG and agent quality in production. By Comet. 18K+ GitHub stars.
Instalación lista para agent
Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.
npx -y tokrepo@latest install a543eba5-fe14-46f3-9aa5-96a5a23b72d0 --target codexEjecutar después de confirmar el plan con dry-run.
What it is
Opik is an open-source LLM observability platform by Comet that provides tracing, evaluation, and production monitoring for AI applications. It instruments LLM calls with a single decorator, runs automated quality evaluations on your outputs, and monitors RAG retrieval quality and agent behavior in production.
It targets AI engineers building production LLM applications who need to debug issues, measure output quality systematically, and catch regressions before users report them.
How it saves time or tokens
Opik surfaces the root cause of quality issues faster than manual debugging. The tracing view shows exactly which step in a multi-step chain produced a bad output, with token counts, latency, and cost for each step. Automated evaluations run continuously, so you know when prompt changes improve or degrade quality. For RAG applications, Opik evaluates retrieval relevance and generation faithfulness, identifying where tokens are wasted on irrelevant context.
How to use
- Install and configure:
pip install opik
opik configure
- Add tracing with one decorator:
import opik
@opik.track()
def generate_answer(question: str):
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model='gpt-4o',
messages=[{'role': 'user', 'content': question}]
)
return response.choices[0].message.content
- Run evaluations on your dataset:
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance
results = evaluate(
experiment_name='qa-v2',
dataset=my_dataset,
task=generate_answer,
scoring_metrics=[Hallucination(), AnswerRelevance()]
)
Example
import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance
@opik.track()
def rag_pipeline(query: str):
# Retrieve relevant documents
docs = vector_store.similarity_search(query, k=3)
context = '\n'.join([d.page_content for d in docs])
# Generate answer with context
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{'role': 'system', 'content': f'Context: {context}'},
{'role': 'user', 'content': query}
]
)
return response.choices[0].message.content
# Evaluate the RAG pipeline
results = evaluate(
experiment_name='rag-eval-v1',
dataset=test_questions,
task=rag_pipeline,
scoring_metrics=[Hallucination(), AnswerRelevance()]
)
Related on TokRepo
- AI tools for monitoring -- LLM monitoring and observability platforms
- AI tools for testing -- Testing and evaluation frameworks for AI
Common pitfalls
- Evaluation metrics like Hallucination use an LLM judge, which adds API costs. Run evaluations on representative samples rather than entire datasets to control costs.
- Tracing in production generates significant data volume. Configure sampling rates for high-traffic applications to keep storage and costs manageable.
- Custom metrics require understanding of the scoring API. Start with built-in metrics (Hallucination, AnswerRelevance, Moderation) before writing custom evaluators.
Preguntas frecuentes
Both provide LLM tracing and evaluation. Opik is open-source and self-hostable, while LangSmith is a commercial product. Opik works with any LLM library and does not require LangChain. Both offer trace visualization, evaluation frameworks, and production monitoring.
Yes. Opik is open-source and can be self-hosted using Docker. The self-hosted version includes the full tracing, evaluation, and dashboard functionality. Comet also offers a cloud-hosted version with additional features and managed infrastructure.
Opik includes built-in metrics for Hallucination, AnswerRelevance, ContextPrecision, ContextRecall, and Moderation. These use LLM-as-judge patterns to score outputs. You can also define custom metrics using Python functions for domain-specific quality criteria.
Yes. Opik is designed with RAG in mind. It traces both the retrieval and generation steps, evaluates retrieval relevance (ContextPrecision, ContextRecall), and checks generation faithfulness (Hallucination). This gives you end-to-end visibility into RAG pipeline quality.
Opik works with any LLM provider. The @opik.track() decorator wraps your existing code regardless of provider. It also provides direct integrations with LangChain, LlamaIndex, and OpenAI for automatic tracing without decorators.
Referencias (3)
- Opik GitHub Repository— Opik is an open-source LLM observability platform by Comet
- Opik Documentation— Opik provides built-in evaluation metrics for hallucination and relevance
- Judging LLM-as-a-Judge Paper— LLM-as-judge evaluation patterns for automated output quality assessment
Relacionados en TokRepo
Fuente y agradecimientos
Discusión
Activos relacionados
TruLens — Evaluate and Track LLM Apps
Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow.
Ragas — Evaluate RAG & LLM Applications
Ragas evaluates LLM applications with objective metrics, test data generation, and data-driven insights. 13.2K+ GitHub stars. RAG evaluation, auto test generation. Apache 2.0.
Evidently — ML & LLM Monitoring with 100+ Metrics
Evaluate, test, and monitor AI systems with 100+ built-in metrics for data drift, model quality, and LLM output. 7.3K+ stars.
Weave — Trace and Debug LLM Apps
Weave adds tracing to LLM apps with `@weave.op`. Install `weave`, call `weave.init()`, then track inputs/outputs across API calls and validation steps.