Introduction
ydata-profiling generates detailed exploratory data analysis reports from a single function call. Originally known as pandas-profiling, it automates the repetitive first step of any data science workflow by surfacing statistics, distributions, correlations, and data quality issues in an interactive HTML report.
What ydata-profiling Does
- Computes descriptive statistics for every column including mean, median, quantiles, and unique counts
- Detects missing values, zeros, and constant columns with visual indicators
- Calculates pairwise correlations using Pearson, Spearman, Kendall, and Phik methods
- Identifies duplicate rows and near-duplicate patterns
- Renders an interactive HTML report with collapsible sections and histograms
Architecture Overview
The library iterates over DataFrame columns, infers types (numeric, categorical, datetime, text, image), and dispatches type-specific analysis routines. Results are collected into a description dictionary, then rendered through a Jinja2 HTML template. For large datasets, it supports minimal mode to skip expensive computations and Spark DataFrames for distributed profiling.
Self-Hosting & Configuration
- Install via pip; optional extras include Spark support and image analysis
- Pass a config object to control which analyses run and set thresholds
- Use minimal=True for datasets with more than one million rows to reduce runtime
- Export reports as HTML, JSON, or inline widgets in Jupyter notebooks
- Integrate with Great Expectations by converting profiles to expectation suites
Key Features
- One-line report generation from any pandas or Spark DataFrame
- Type inference engine adapts analysis to numeric, categorical, datetime, boolean, text, and URL columns
- Comparison mode diffs two datasets side by side for drift detection
- Sensitive data detection flags columns containing emails, URLs, or file paths
- Time-series mode adds autocorrelation and seasonality analysis for temporal data
Comparison with Similar Tools
- pandas describe() — basic statistics only; ydata-profiling adds visualizations, correlations, and alerts
- Sweetviz — similar HTML reports; ydata-profiling supports Spark and offers more configuration
- D-Tale — interactive DataFrame browser; ydata-profiling produces static shareable reports
- Great Expectations — focuses on validation rules; ydata-profiling focuses on exploratory analysis
- Lux — auto-visualization in Jupyter; ydata-profiling generates comprehensive standalone reports
FAQ
Q: How long does a report take to generate? A: A few seconds for datasets under 100K rows. Use minimal=True or sampling for million-row datasets.
Q: Can I customize which sections appear in the report? A: Yes. Pass a ProfileReport config to disable correlations, duplicates, or specific variable analyses.
Q: Does it work with PySpark DataFrames? A: Yes. Install the spark extra and pass a Spark DataFrame directly. Profiling runs distributed across the cluster.
Q: Can I embed the report in a web application? A: Export to HTML and serve it statically, or use to_widgets() to embed inside a Jupyter or Streamlit app.