Great Expectations — Test Your Data Like You Test Code
The Problem
Bad data silently breaks ML models. A training dataset with null values, outliers, or schema changes can waste days of compute and produce unreliable models. Most teams don't catch data issues until after the damage is done.
The Solution
Great Expectations brings software testing practices to data. Write expectations (assertions) about your data, run them automatically in your pipeline, and get clear reports when something is wrong.
Key Features
- 300+ built-in expectations — null checks, range validation, regex matching, statistical tests
- Auto-profiling — automatically generate expectations from sample data
- Data Docs — auto-generated HTML documentation of your data quality
- Multiple backends — Pandas, Spark, SQLAlchemy (PostgreSQL, MySQL, BigQuery, etc.)
- Pipeline integration — works with Airflow, Dagster, Prefect, dbt
- Checkpoint system — schedule validation runs and get alerts on failures
- Custom expectations — write your own domain-specific validations
Quick Start
pip install great_expectations
great_expectations initCommon Expectations
import great_expectations as gx
# Column-level checks
batch.expect_column_values_to_not_be_null("email")
batch.expect_column_values_to_be_unique("user_id")
batch.expect_column_values_to_be_between("price", min_value=0)
batch.expect_column_values_to_match_regex("email", r"^[w.]+@[w.]+.w+$")
# Table-level checks
batch.expect_table_row_count_to_be_between(min_value=1000, max_value=1000000)
batch.expect_table_columns_to_match_ordered_list(["id", "name", "email", "created_at"])
# Statistical checks
batch.expect_column_mean_to_be_between("age", min_value=18, max_value=65)
batch.expect_column_stdev_to_be_between("score", min_value=0, max_value=30)Integration with AI/ML Pipelines
# In your training pipeline
checkpoint = context.add_or_update_checkpoint(
name="training_data_check",
validations=[{
"batch_request": training_batch_request,
"expectation_suite_name": "training_data_suite"
}]
)
result = checkpoint.run()
if not result.success:
raise ValueError("Training data failed quality checks!")FAQ
Q: What is Great Expectations? A: An open-source data validation framework that lets you write expressive assertions about your data, catching quality issues before they break AI/ML pipelines.
Q: Is Great Expectations free? A: The open-source core is free under Apache-2.0. There is also a managed cloud version (GX Cloud) with additional features.
Q: What data sources does it support? A: Pandas DataFrames, Spark, PostgreSQL, MySQL, BigQuery, Snowflake, Databricks, Redshift, and more via SQLAlchemy.