# ydata-profiling — Automated Data Quality Profiling for DataFrames > ydata-profiling (formerly pandas-profiling) generates comprehensive HTML reports from pandas or Spark DataFrames, covering statistics, correlations, missing values, duplicates, and data type analysis. ## Install Save in your project root: # ydata-profiling — Automated Data Quality Profiling for DataFrames ## Quick Use ```bash pip install ydata-profiling python -c " import pandas as pd from ydata_profiling import ProfileReport df = pd.read_csv('data.csv') report = ProfileReport(df, title='Data Report') report.to_file('report.html') " ``` ## Introduction ydata-profiling generates detailed exploratory data analysis reports from a single function call. Originally known as pandas-profiling, it automates the repetitive first step of any data science workflow by surfacing statistics, distributions, correlations, and data quality issues in an interactive HTML report. ## What ydata-profiling Does - Computes descriptive statistics for every column including mean, median, quantiles, and unique counts - Detects missing values, zeros, and constant columns with visual indicators - Calculates pairwise correlations using Pearson, Spearman, Kendall, and Phik methods - Identifies duplicate rows and near-duplicate patterns - Renders an interactive HTML report with collapsible sections and histograms ## Architecture Overview The library iterates over DataFrame columns, infers types (numeric, categorical, datetime, text, image), and dispatches type-specific analysis routines. Results are collected into a description dictionary, then rendered through a Jinja2 HTML template. For large datasets, it supports minimal mode to skip expensive computations and Spark DataFrames for distributed profiling. ## Self-Hosting & Configuration - Install via pip; optional extras include Spark support and image analysis - Pass a config object to control which analyses run and set thresholds - Use minimal=True for datasets with more than one million rows to reduce runtime - Export reports as HTML, JSON, or inline widgets in Jupyter notebooks - Integrate with Great Expectations by converting profiles to expectation suites ## Key Features - One-line report generation from any pandas or Spark DataFrame - Type inference engine adapts analysis to numeric, categorical, datetime, boolean, text, and URL columns - Comparison mode diffs two datasets side by side for drift detection - Sensitive data detection flags columns containing emails, URLs, or file paths - Time-series mode adds autocorrelation and seasonality analysis for temporal data ## Comparison with Similar Tools - **pandas describe()** — basic statistics only; ydata-profiling adds visualizations, correlations, and alerts - **Sweetviz** — similar HTML reports; ydata-profiling supports Spark and offers more configuration - **D-Tale** — interactive DataFrame browser; ydata-profiling produces static shareable reports - **Great Expectations** — focuses on validation rules; ydata-profiling focuses on exploratory analysis - **Lux** — auto-visualization in Jupyter; ydata-profiling generates comprehensive standalone reports ## FAQ **Q: How long does a report take to generate?** A: A few seconds for datasets under 100K rows. Use minimal=True or sampling for million-row datasets. **Q: Can I customize which sections appear in the report?** A: Yes. Pass a ProfileReport config to disable correlations, duplicates, or specific variable analyses. **Q: Does it work with PySpark DataFrames?** A: Yes. Install the spark extra and pass a Spark DataFrame directly. Profiling runs distributed across the cluster. **Q: Can I embed the report in a web application?** A: Export to HTML and serve it statically, or use to_widgets() to embed inside a Jupyter or Streamlit app. ## Sources - https://github.com/ydataai/ydata-profiling - https://docs.profiling.ydata.ai/ --- Source: https://tokrepo.com/en/workflows/asset-6c8a5473 Author: AI Open Source