# Great Expectations — Data Validation for AI Pipelines

> Test your data like you test code. Validate data quality in AI/ML pipelines with expressive assertions, auto-profiling, and data docs. Apache-2.0, 11,400+ stars.

## Install

Save as a script file and run:

## Quick Use

1. Install:
```bash
pip install great_expectations
```

2. Initialize in your project:
```bash
great_expectations init
```

3. Create your first expectation:
```python
import great_expectations as gx

context = gx.get_context()
ds = context.sources.add_pandas("my_datasource")
asset = ds.add_dataframe_asset("my_data")

batch = asset.get_batch(dataframe=your_dataframe)
batch.expect_column_values_to_not_be_null("user_id")
batch.expect_column_values_to_be_between("age", min_value=0, max_value=150)
```

---

## Intro

Great Expectations is the leading data validation framework with 11,400+ GitHub stars. It lets you write expressive tests for your data — just like unit tests for code — catching data quality issues before they break your AI/ML pipelines. Features auto-profiling, 300+ built-in expectations, and auto-generated data documentation. Best for data engineers and ML practitioners building production data pipelines who need reliable data quality checks. Works with Pandas, Spark, SQL databases, and cloud data warehouses.

See also: [AI pipeline tools on TokRepo](https://tokrepo.com/en/@Script%20Depot).

---

## Great Expectations — Test Your Data Like You Test Code

### The Problem

Bad data silently breaks ML models. A training dataset with null values, outliers, or schema changes can waste days of compute and produce unreliable models. Most teams don't catch data issues until after the damage is done.

### The Solution

Great Expectations brings software testing practices to data. Write expectations (assertions) about your data, run them automatically in your pipeline, and get clear reports when something is wrong.

### Key Features

- **300+ built-in expectations** — null checks, range validation, regex matching, statistical tests
- **Auto-profiling** — automatically generate expectations from sample data
- **Data Docs** — auto-generated HTML documentation of your data quality
- **Multiple backends** — Pandas, Spark, SQLAlchemy (PostgreSQL, MySQL, BigQuery, etc.)
- **Pipeline integration** — works with Airflow, Dagster, Prefect, dbt
- **Checkpoint system** — schedule validation runs and get alerts on failures
- **Custom expectations** — write your own domain-specific validations

### Quick Start

```bash
pip install great_expectations
great_expectations init
```

### Common Expectations

```python
import great_expectations as gx

# Column-level checks
batch.expect_column_values_to_not_be_null("email")
batch.expect_column_values_to_be_unique("user_id")
batch.expect_column_values_to_be_between("price", min_value=0)
batch.expect_column_values_to_match_regex("email", r"^[w.]+@[w.]+.w+$")

# Table-level checks
batch.expect_table_row_count_to_be_between(min_value=1000, max_value=1000000)
batch.expect_table_columns_to_match_ordered_list(["id", "name", "email", "created_at"])

# Statistical checks
batch.expect_column_mean_to_be_between("age", min_value=18, max_value=65)
batch.expect_column_stdev_to_be_between("score", min_value=0, max_value=30)
```

### Integration with AI/ML Pipelines

```python
# In your training pipeline
checkpoint = context.add_or_update_checkpoint(
    name="training_data_check",
    validations=[{
        "batch_request": training_batch_request,
        "expectation_suite_name": "training_data_suite"
    }]
)

result = checkpoint.run()
if not result.success:
    raise ValueError("Training data failed quality checks!")
```

### FAQ

**Q: What is Great Expectations?**
A: An open-source data validation framework that lets you write expressive assertions about your data, catching quality issues before they break AI/ML pipelines.

**Q: Is Great Expectations free?**
A: The open-source core is free under Apache-2.0. There is also a managed cloud version (GX Cloud) with additional features.

**Q: What data sources does it support?**
A: Pandas DataFrames, Spark, PostgreSQL, MySQL, BigQuery, Snowflake, Databricks, Redshift, and more via SQLAlchemy.

---

## Source & Thanks

> Created by [Great Expectations](https://github.com/great-expectations). Licensed under Apache-2.0.
>
> [great_expectations](https://github.com/great-expectations/great_expectations) — ⭐ 11,400+

Thanks to the Great Expectations team for bringing software engineering rigor to data quality.

---

<!-- ZH -->

## 快速使用

1. 安装：
```bash
pip install great_expectations
```

2. 初始化：
```bash
great_expectations init
```

3. 写第一个数据验证：
```python
import great_expectations as gx
context = gx.get_context()
# 配置数据源并添加期望...
```

---

## 简介

Great Expectations 是领先的数据验证框架，GitHub 11,400+ star。像写单元测试一样写数据测试，在数据质量问题破坏 AI/ML 管道之前发现它们。支持 300+ 内置期望、自动分析和自动生成数据文档。适合构建生产数据管道的数据工程师和 ML 从业者。支持 Pandas、Spark、SQL 数据库等。

---

## Great Expectations — 像测试代码一样测试数据

### 核心功能

- **300+ 内置期望** — 空值检查、范围验证、正则匹配、统计测试
- **自动分析** — 从样本数据自动生成期望
- **数据文档** — 自动生成 HTML 数据质量报告
- **多后端** — Pandas、Spark、PostgreSQL、BigQuery 等
- **管道集成** — Airflow、Dagster、Prefect、dbt

### FAQ

**Q: Great Expectations 是什么？**
A: 开源数据验证框架，像写单元测试一样写数据断言，在问题影响 AI/ML 管道前发现它们。

**Q: 免费吗？**
A: 开源核心免费（Apache-2.0），另有付费云版本。

---

## 来源与感谢

> Created by [Great Expectations](https://github.com/great-expectations). Licensed under Apache-2.0.
>
> [great_expectations](https://github.com/great-expectations/great_expectations) — ⭐ 11,400+


---
Source: https://tokrepo.com/en/workflows/153bb8e0-33d7-11f1-9bc6-00163e2b0d79
Author: Script Depot