WorkflowsApr 3, 2026·2 min read

Evidently — ML & LLM Monitoring with 100+ Metrics

Evaluate, test, and monitor AI systems with 100+ built-in metrics for data drift, model quality, and LLM output. 7.3K+ stars.

TL;DR
Evidently provides 100+ metrics for monitoring ML and LLM application quality.
§01

What it is

Evidently is an open-source Python library for monitoring machine learning models and LLM applications. It provides over 100 built-in metrics covering data drift, model quality, classification and regression performance, and text analysis. You can generate reports as interactive HTML dashboards, run tests as part of CI/CD pipelines, and monitor production models in real-time.

It targets ML engineers and data scientists who need to track model performance after deployment and catch degradation early.

§02

How it saves time or tokens

Evidently automates the monitoring work that teams often do manually with custom scripts. Instead of writing drift detection, quality metrics, and visualization code, you configure a preset and get a comprehensive report. For LLM applications, the text metrics analyze output quality (length, sentiment, toxicity, patterns) without building custom evaluation pipelines. The test suite integration catches regressions in CI before they reach production.

§03

How to use

  1. Install the library:
pip install evidently
  1. Generate a data drift report:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd

reference = pd.read_csv('training_data.csv')
current = pd.read_csv('production_data.csv')

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html('drift_report.html')
  1. Run automated tests:
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset

suite = TestSuite(tests=[DataDriftTestPreset()])
suite.run(reference_data=reference, current_data=current)
print(suite)  # Pass/Fail for each test
§04

Example

from evidently.report import Report
from evidently.metric_preset import TextEvals
from evidently import ColumnMapping
import pandas as pd

# Monitor LLM output quality
llm_outputs = pd.DataFrame({
    'prompt': ['Summarize this article', 'Write a haiku', 'Explain recursion'],
    'response': ['The article discusses...', 'Autumn leaves falling...', 'Recursion is when...']
})

column_mapping = ColumnMapping(
    text_features=['response']
)

report = Report(metrics=[TextEvals(column_name='response')])
report.run(current_data=llm_outputs, column_mapping=column_mapping)
report.save_html('llm_quality_report.html')
§05

Related on TokRepo

§06

Common pitfalls

  • Data drift detection requires a reference dataset. Without a good reference (typically your training data), drift alerts may be noisy or misleading.
  • The 100+ metrics can be overwhelming. Start with presets (DataDriftPreset, DataQualityPreset) and add individual metrics only when you need specific insights.
  • Interactive HTML reports grow large for big datasets. For production monitoring, use the Evidently monitoring UI or export metrics to Grafana instead of HTML files.

Frequently Asked Questions

What types of drift does Evidently detect?+

Evidently detects data drift (changes in feature distributions), target drift (changes in label distribution), concept drift (changes in the relationship between features and targets), and prediction drift. It uses statistical tests (KS test, PSI, Jensen-Shannon divergence) with configurable thresholds.

Can Evidently monitor LLM applications?+

Yes. Evidently includes text-specific metrics that analyze LLM outputs for length, sentiment, toxicity, and pattern detection. You can monitor prompt-response pairs over time, detect output quality degradation, and set up alerts when text metrics cross thresholds.

Does Evidently integrate with CI/CD pipelines?+

Yes. The TestSuite API returns pass/fail results for each metric test. You can run test suites in CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI) and fail the build if data quality or model performance drops below thresholds.

How does Evidently compare to Weights and Biases?+

Weights and Biases focuses on experiment tracking during model development. Evidently focuses on post-deployment monitoring and data quality testing. They complement each other -- use W&B during training and Evidently for production monitoring.

Can I visualize Evidently metrics in Grafana?+

Yes. Evidently provides a monitoring UI and can export metrics to Prometheus format for Grafana dashboards. This lets you integrate ML monitoring alongside your existing infrastructure monitoring in a single dashboard.

Citations (3)
🙏

Source & Thanks

Created by Evidently AI. Licensed under Apache-2.0.

evidently — ⭐ 7,300+

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets