Introduction
statsmodels complements scikit-learn by focusing on classical statistical inference rather than prediction. It provides detailed model summaries with coefficients, standard errors, p-values, and confidence intervals — the output statisticians and economists expect from tools like R or Stata.
What statsmodels Does
- Fits linear and generalized linear models with comprehensive diagnostic output
- Implements time-series analysis including ARIMA, VAR, state-space models, and seasonal decomposition
- Provides nonparametric methods like kernel density estimation and lowess smoothing
- Runs hypothesis tests (t-test, F-test, Granger causality, unit root tests)
- Generates publication-ready regression tables and diagnostic plots
Architecture Overview
statsmodels follows a model-fit-results pattern. You specify a model class (OLS, Logit, ARIMA), call .fit() to estimate parameters, and receive a results object with properties for coefficients, residuals, information criteria, and statistical tests. Under the hood, estimation uses scipy.optimize and numpy linear algebra routines.
Self-Hosting & Configuration
- Install via pip: pip install statsmodels
- Depends on NumPy, SciPy, pandas, and patsy for formula-based model specification
- Use R-style formulas: sm.OLS.from_formula("y ~ x1 + x2", data=df)
- Configure optimizer parameters and covariance estimators per model
- Works in Jupyter notebooks with rich HTML output for model summaries
Key Features
- Comprehensive model summaries matching R/Stata output with AIC, BIC, R-squared, and residual diagnostics
- Time-series toolbox with ARIMA, SARIMAX, VAR, and exponential smoothing
- Robust covariance estimators (HC0-HC3, HAC, clustered) for correct inference under heteroscedasticity
- Mixed-effects models for hierarchical and panel data
- Survival analysis with Kaplan-Meier and Cox proportional hazards
Comparison with Similar Tools
- scikit-learn — focused on prediction accuracy; statsmodels provides inference statistics (p-values, confidence intervals)
- R (stats package) — the gold standard for statistical computing; statsmodels brings similar functionality to the Python ecosystem
- SciPy (scipy.stats) — provides individual tests; statsmodels offers full model estimation and diagnostics
- linearmodels — extends statsmodels with panel data and IV models; statsmodels covers the broader foundation
FAQ
Q: When should I use statsmodels instead of scikit-learn? A: Use statsmodels when you need to understand relationships (coefficients, significance, confidence intervals) rather than just predict outcomes.
Q: Does statsmodels support regularized regression? A: Yes. OLS and GLM classes support elastic net regularization via fit_regularized(), though scikit-learn may be more convenient for pure prediction tasks.
Q: Can I use statsmodels for time-series forecasting? A: Yes. ARIMA, SARIMAX, and state-space models are well-implemented with automatic parameter selection helpers.
Q: How does the formula API work? A: Use patsy-style formulas like "y ~ x1 + x2 + x1:x2" to specify models declaratively from a DataFrame, similar to R.