# ydata-profiling — Automated Data Quality Profiling for DataFrames

> ydata-profiling (formerly pandas-profiling) generates comprehensive HTML reports from pandas or Spark DataFrames, covering statistics, correlations, missing values, duplicates, and data type analysis.

## Install

Save in your project root:

# ydata-profiling — Automated Data Quality Profiling for DataFrames

## Quick Use
```bash
pip install ydata-profiling
python -c "
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv('data.csv')
report = ProfileReport(df, title='Data Report')
report.to_file('report.html')
"
```

## Introduction
ydata-profiling generates detailed exploratory data analysis reports from a single function call. Originally known as pandas-profiling, it automates the repetitive first step of any data science workflow by surfacing statistics, distributions, correlations, and data quality issues in an interactive HTML report.

## What ydata-profiling Does
- Computes descriptive statistics for every column including mean, median, quantiles, and unique counts
- Detects missing values, zeros, and constant columns with visual indicators
- Calculates pairwise correlations using Pearson, Spearman, Kendall, and Phik methods
- Identifies duplicate rows and near-duplicate patterns
- Renders an interactive HTML report with collapsible sections and histograms

## Architecture Overview
The library iterates over DataFrame columns, infers types (numeric, categorical, datetime, text, image), and dispatches type-specific analysis routines. Results are collected into a description dictionary, then rendered through a Jinja2 HTML template. For large datasets, it supports minimal mode to skip expensive computations and Spark DataFrames for distributed profiling.

## Self-Hosting & Configuration
- Install via pip; optional extras include Spark support and image analysis
- Pass a config object to control which analyses run and set thresholds
- Use minimal=True for datasets with more than one million rows to reduce runtime
- Export reports as HTML, JSON, or inline widgets in Jupyter notebooks
- Integrate with Great Expectations by converting profiles to expectation suites

## Key Features
- One-line report generation from any pandas or Spark DataFrame
- Type inference engine adapts analysis to numeric, categorical, datetime, boolean, text, and URL columns
- Comparison mode diffs two datasets side by side for drift detection
- Sensitive data detection flags columns containing emails, URLs, or file paths
- Time-series mode adds autocorrelation and seasonality analysis for temporal data

## Comparison with Similar Tools
- **pandas describe()** — basic statistics only; ydata-profiling adds visualizations, correlations, and alerts
- **Sweetviz** — similar HTML reports; ydata-profiling supports Spark and offers more configuration
- **D-Tale** — interactive DataFrame browser; ydata-profiling produces static shareable reports
- **Great Expectations** — focuses on validation rules; ydata-profiling focuses on exploratory analysis
- **Lux** — auto-visualization in Jupyter; ydata-profiling generates comprehensive standalone reports

## FAQ
**Q: How long does a report take to generate?**
A: A few seconds for datasets under 100K rows. Use minimal=True or sampling for million-row datasets.

**Q: Can I customize which sections appear in the report?**
A: Yes. Pass a ProfileReport config to disable correlations, duplicates, or specific variable analyses.

**Q: Does it work with PySpark DataFrames?**
A: Yes. Install the spark extra and pass a Spark DataFrame directly. Profiling runs distributed across the cluster.

**Q: Can I embed the report in a web application?**
A: Export to HTML and serve it statically, or use to_widgets() to embed inside a Jupyter or Streamlit app.

## Sources
- https://github.com/ydataai/ydata-profiling
- https://docs.profiling.ydata.ai/

---
Source: https://tokrepo.com/en/workflows/asset-6c8a5473
Author: AI Open Source