# Great Expectations — Data Validation for AI Pipelines > Test your data like you test code. Validate data quality in AI/ML pipelines with expressive assertions, auto-profiling, and data docs. Apache-2.0, 11,400+ stars. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: ## Quick Use 1. Install: ```bash pip install great_expectations ``` 2. Initialize in your project: ```bash great_expectations init ``` 3. Create your first expectation: ```python import great_expectations as gx context = gx.get_context() ds = context.sources.add_pandas("my_datasource") asset = ds.add_dataframe_asset("my_data") batch = asset.get_batch(dataframe=your_dataframe) batch.expect_column_values_to_not_be_null("user_id") batch.expect_column_values_to_be_between("age", min_value=0, max_value=150) ``` --- ## Intro Great Expectations is the leading data validation framework with 11,400+ GitHub stars. It lets you write expressive tests for your data — just like unit tests for code — catching data quality issues before they break your AI/ML pipelines. Features auto-profiling, 300+ built-in expectations, and auto-generated data documentation. Best for data engineers and ML practitioners building production data pipelines who need reliable data quality checks. Works with Pandas, Spark, SQL databases, and cloud data warehouses. See also: [AI pipeline tools on TokRepo](https://tokrepo.com/en/@Script%20Depot). --- ## Great Expectations — Test Your Data Like You Test Code ### The Problem Bad data silently breaks ML models. A training dataset with null values, outliers, or schema changes can waste days of compute and produce unreliable models. Most teams don't catch data issues until after the damage is done. ### The Solution Great Expectations brings software testing practices to data. Write expectations (assertions) about your data, run them automatically in your pipeline, and get clear reports when something is wrong. ### Key Features - **300+ built-in expectations** — null checks, range validation, regex matching, statistical tests - **Auto-profiling** — automatically generate expectations from sample data - **Data Docs** — auto-generated HTML documentation of your data quality - **Multiple backends** — Pandas, Spark, SQLAlchemy (PostgreSQL, MySQL, BigQuery, etc.) - **Pipeline integration** — works with Airflow, Dagster, Prefect, dbt - **Checkpoint system** — schedule validation runs and get alerts on failures - **Custom expectations** — write your own domain-specific validations ### Quick Start ```bash pip install great_expectations great_expectations init ``` ### Common Expectations ```python import great_expectations as gx # Column-level checks batch.expect_column_values_to_not_be_null("email") batch.expect_column_values_to_be_unique("user_id") batch.expect_column_values_to_be_between("price", min_value=0) batch.expect_column_values_to_match_regex("email", r"^[w.]+@[w.]+.w+$") # Table-level checks batch.expect_table_row_count_to_be_between(min_value=1000, max_value=1000000) batch.expect_table_columns_to_match_ordered_list(["id", "name", "email", "created_at"]) # Statistical checks batch.expect_column_mean_to_be_between("age", min_value=18, max_value=65) batch.expect_column_stdev_to_be_between("score", min_value=0, max_value=30) ``` ### Integration with AI/ML Pipelines ```python # In your training pipeline checkpoint = context.add_or_update_checkpoint( name="training_data_check", validations=[{ "batch_request": training_batch_request, "expectation_suite_name": "training_data_suite" }] ) result = checkpoint.run() if not result.success: raise ValueError("Training data failed quality checks!") ``` ### FAQ **Q: What is Great Expectations?** A: An open-source data validation framework that lets you write expressive assertions about your data, catching quality issues before they break AI/ML pipelines. **Q: Is Great Expectations free?** A: The open-source core is free under Apache-2.0. There is also a managed cloud version (GX Cloud) with additional features. **Q: What data sources does it support?** A: Pandas DataFrames, Spark, PostgreSQL, MySQL, BigQuery, Snowflake, Databricks, Redshift, and more via SQLAlchemy. --- ## Source & Thanks > Created by [Great Expectations](https://github.com/great-expectations). Licensed under Apache-2.0. > > [great_expectations](https://github.com/great-expectations/great_expectations) — ⭐ 11,400+ Thanks to the Great Expectations team for bringing software engineering rigor to data quality. --- ## Quick Use 1. Install: ```bash pip install great_expectations ``` 2. Initialize: ```bash great_expectations init ``` 3. Write your first data validation: ```python import great_expectations as gx context = gx.get_context() # Configure data sources and add expectations... ``` --- ## Introduction Great Expectations is the leading data validation framework, with 11,400+ GitHub stars. Write data tests the way you write unit tests — catch data quality issues before they break your AI/ML pipelines. Supports 300+ built-in expectations, auto-profiling, and auto-generated data docs. Ideal for data engineers and ML practitioners building production data pipelines. Supports Pandas, Spark, SQL databases, and more. --- ## Great Expectations — Test Data Like You Test Code ### Core Features - **300+ built-in expectations** — null checks, range validation, regex matching, statistical tests - **Auto-profiling** — generate expectations from sample data - **Data docs** — auto-generated HTML data quality reports - **Multi-backend** — Pandas, Spark, PostgreSQL, BigQuery, and more - **Pipeline integration** — Airflow, Dagster, Prefect, dbt ### FAQ **Q: What is Great Expectations?** A: An open-source data validation framework that lets you write data assertions like unit tests to catch issues before they impact AI/ML pipelines. **Q: Is it free?** A: The open-source core is free (Apache-2.0); a paid cloud version is also available. --- ## Source & Thanks > Created by [Great Expectations](https://github.com/great-expectations). Licensed under Apache-2.0. > > [great_expectations](https://github.com/great-expectations/great_expectations) — ⭐ 11,400+ --- Source: https://tokrepo.com/en/workflows/great-expectations-data-validation-ai-pipelines-153bb8e0 Author: Script Depot