ydata-profiling — Automated Data Quality Profiling for DataFrames

Introduction

ydata-profiling generates detailed exploratory data analysis reports from a single function call. Originally known as pandas-profiling, it automates the repetitive first step of any data science workflow by surfacing statistics, distributions, correlations, and data quality issues in an interactive HTML report.

What ydata-profiling Does

Computes descriptive statistics for every column including mean, median, quantiles, and unique counts
Detects missing values, zeros, and constant columns with visual indicators
Calculates pairwise correlations using Pearson, Spearman, Kendall, and Phik methods
Identifies duplicate rows and near-duplicate patterns
Renders an interactive HTML report with collapsible sections and histograms

Architecture Overview

The library iterates over DataFrame columns, infers types (numeric, categorical, datetime, text, image), and dispatches type-specific analysis routines. Results are collected into a description dictionary, then rendered through a Jinja2 HTML template. For large datasets, it supports minimal mode to skip expensive computations and Spark DataFrames for distributed profiling.

Self-Hosting & Configuration

Install via pip; optional extras include Spark support and image analysis
Pass a config object to control which analyses run and set thresholds
Use minimal=True for datasets with more than one million rows to reduce runtime
Export reports as HTML, JSON, or inline widgets in Jupyter notebooks
Integrate with Great Expectations by converting profiles to expectation suites

Key Features

One-line report generation from any pandas or Spark DataFrame
Type inference engine adapts analysis to numeric, categorical, datetime, boolean, text, and URL columns
Comparison mode diffs two datasets side by side for drift detection
Sensitive data detection flags columns containing emails, URLs, or file paths
Time-series mode adds autocorrelation and seasonality analysis for temporal data

Comparison with Similar Tools

pandas describe() — basic statistics only; ydata-profiling adds visualizations, correlations, and alerts
Sweetviz — similar HTML reports; ydata-profiling supports Spark and offers more configuration
D-Tale — interactive DataFrame browser; ydata-profiling produces static shareable reports
Great Expectations — focuses on validation rules; ydata-profiling focuses on exploratory analysis
Lux — auto-visualization in Jupyter; ydata-profiling generates comprehensive standalone reports

FAQ

Q: How long does a report take to generate? A: A few seconds for datasets under 100K rows. Use minimal=True or sampling for million-row datasets.

Q: Can I customize which sections appear in the report? A: Yes. Pass a ProfileReport config to disable correlations, duplicates, or specific variable analyses.

Q: Does it work with PySpark DataFrames? A: Yes. Install the spark extra and pass a Spark DataFrame directly. Profiling runs distributed across the cluster.

Q: Can I embed the report in a web application? A: Export to HTML and serve it statically, or use to_widgets() to embed inside a Jupyter or Streamlit app.

ydata-profiling — Automated Data Quality Profiling for DataFrames

Introduction

What ydata-profiling Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

Radarr — Automated Movie Collection Manager

Seaborn — Statistical Data Visualization Built on Matplotlib

CrateDB — Distributed SQL Database for Machine Data

Parca — Continuous Profiling for Infrastructure Optimization