ConfigsMay 10, 2026·3 min read

ydata-profiling — Automated Data Quality Profiling for DataFrames

ydata-profiling (formerly pandas-profiling) generates comprehensive HTML reports from pandas or Spark DataFrames, covering statistics, correlations, missing values, duplicates, and data type analysis.

Introduction

ydata-profiling generates detailed exploratory data analysis reports from a single function call. Originally known as pandas-profiling, it automates the repetitive first step of any data science workflow by surfacing statistics, distributions, correlations, and data quality issues in an interactive HTML report.

What ydata-profiling Does

  • Computes descriptive statistics for every column including mean, median, quantiles, and unique counts
  • Detects missing values, zeros, and constant columns with visual indicators
  • Calculates pairwise correlations using Pearson, Spearman, Kendall, and Phik methods
  • Identifies duplicate rows and near-duplicate patterns
  • Renders an interactive HTML report with collapsible sections and histograms

Architecture Overview

The library iterates over DataFrame columns, infers types (numeric, categorical, datetime, text, image), and dispatches type-specific analysis routines. Results are collected into a description dictionary, then rendered through a Jinja2 HTML template. For large datasets, it supports minimal mode to skip expensive computations and Spark DataFrames for distributed profiling.

Self-Hosting & Configuration

  • Install via pip; optional extras include Spark support and image analysis
  • Pass a config object to control which analyses run and set thresholds
  • Use minimal=True for datasets with more than one million rows to reduce runtime
  • Export reports as HTML, JSON, or inline widgets in Jupyter notebooks
  • Integrate with Great Expectations by converting profiles to expectation suites

Key Features

  • One-line report generation from any pandas or Spark DataFrame
  • Type inference engine adapts analysis to numeric, categorical, datetime, boolean, text, and URL columns
  • Comparison mode diffs two datasets side by side for drift detection
  • Sensitive data detection flags columns containing emails, URLs, or file paths
  • Time-series mode adds autocorrelation and seasonality analysis for temporal data

Comparison with Similar Tools

  • pandas describe() — basic statistics only; ydata-profiling adds visualizations, correlations, and alerts
  • Sweetviz — similar HTML reports; ydata-profiling supports Spark and offers more configuration
  • D-Tale — interactive DataFrame browser; ydata-profiling produces static shareable reports
  • Great Expectations — focuses on validation rules; ydata-profiling focuses on exploratory analysis
  • Lux — auto-visualization in Jupyter; ydata-profiling generates comprehensive standalone reports

FAQ

Q: How long does a report take to generate? A: A few seconds for datasets under 100K rows. Use minimal=True or sampling for million-row datasets.

Q: Can I customize which sections appear in the report? A: Yes. Pass a ProfileReport config to disable correlations, duplicates, or specific variable analyses.

Q: Does it work with PySpark DataFrames? A: Yes. Install the spark extra and pass a Spark DataFrame directly. Profiling runs distributed across the cluster.

Q: Can I embed the report in a web application? A: Export to HTML and serve it statically, or use to_widgets() to embed inside a Jupyter or Streamlit app.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets