Evidently — ML & LLM Monitoring with 100+ Metrics
Evaluate, test, and monitor AI systems with 100+ built-in metrics for data drift, model quality, and LLM output. 7.3K+ stars.
What it is
Evidently is an open-source Python library for monitoring machine learning models and LLM applications. It provides over 100 built-in metrics covering data drift, model quality, classification and regression performance, and text analysis. You can generate reports as interactive HTML dashboards, run tests as part of CI/CD pipelines, and monitor production models in real-time.
It targets ML engineers and data scientists who need to track model performance after deployment and catch degradation early.
How it saves time or tokens
Evidently automates the monitoring work that teams often do manually with custom scripts. Instead of writing drift detection, quality metrics, and visualization code, you configure a preset and get a comprehensive report. For LLM applications, the text metrics analyze output quality (length, sentiment, toxicity, patterns) without building custom evaluation pipelines. The test suite integration catches regressions in CI before they reach production.
How to use
- Install the library:
pip install evidently
- Generate a data drift report:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
reference = pd.read_csv('training_data.csv')
current = pd.read_csv('production_data.csv')
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html('drift_report.html')
- Run automated tests:
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset
suite = TestSuite(tests=[DataDriftTestPreset()])
suite.run(reference_data=reference, current_data=current)
print(suite) # Pass/Fail for each test
Example
from evidently.report import Report
from evidently.metric_preset import TextEvals
from evidently import ColumnMapping
import pandas as pd
# Monitor LLM output quality
llm_outputs = pd.DataFrame({
'prompt': ['Summarize this article', 'Write a haiku', 'Explain recursion'],
'response': ['The article discusses...', 'Autumn leaves falling...', 'Recursion is when...']
})
column_mapping = ColumnMapping(
text_features=['response']
)
report = Report(metrics=[TextEvals(column_name='response')])
report.run(current_data=llm_outputs, column_mapping=column_mapping)
report.save_html('llm_quality_report.html')
Related on TokRepo
- AI tools for monitoring -- ML and LLM monitoring platforms
- AI tools for testing -- Automated testing and evaluation tools
Common pitfalls
- Data drift detection requires a reference dataset. Without a good reference (typically your training data), drift alerts may be noisy or misleading.
- The 100+ metrics can be overwhelming. Start with presets (DataDriftPreset, DataQualityPreset) and add individual metrics only when you need specific insights.
- Interactive HTML reports grow large for big datasets. For production monitoring, use the Evidently monitoring UI or export metrics to Grafana instead of HTML files.
Frequently Asked Questions
Evidently detects data drift (changes in feature distributions), target drift (changes in label distribution), concept drift (changes in the relationship between features and targets), and prediction drift. It uses statistical tests (KS test, PSI, Jensen-Shannon divergence) with configurable thresholds.
Yes. Evidently includes text-specific metrics that analyze LLM outputs for length, sentiment, toxicity, and pattern detection. You can monitor prompt-response pairs over time, detect output quality degradation, and set up alerts when text metrics cross thresholds.
Yes. The TestSuite API returns pass/fail results for each metric test. You can run test suites in CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI) and fail the build if data quality or model performance drops below thresholds.
Weights and Biases focuses on experiment tracking during model development. Evidently focuses on post-deployment monitoring and data quality testing. They complement each other -- use W&B during training and Evidently for production monitoring.
Yes. Evidently provides a monitoring UI and can export metrics to Prometheus format for Grafana dashboards. This lets you integrate ML monitoring alongside your existing infrastructure monitoring in a single dashboard.
Citations (3)
- Evidently GitHub Repository— Evidently provides 100+ metrics for ML and LLM monitoring
- Evidently Documentation— Evidently supports data drift, model quality, and text analysis metrics
- Google ML Best Practices— Data drift detection is essential for maintaining ML model quality in production
Related on TokRepo
Source & Thanks
Created by Evidently AI. Licensed under Apache-2.0.
evidently — ⭐ 7,300+
Discussion
Related Assets
DTM — Distributed Transaction Manager for Microservices
A cross-language distributed transaction framework supporting Saga, TCC, XA, and two-phase message patterns for reliable microservice coordination.
WatermelonDB — Reactive Database for React Native Apps
A high-performance reactive database framework for React Native and React web apps, built on top of SQLite with lazy loading and sync primitives.
Dexie.js — Minimalist IndexedDB Wrapper for the Web
A lightweight wrapper around IndexedDB that provides a clean Promise-based API for client-side storage in web applications.