# imbalanced-learn — Handle Imbalanced Datasets in Python

> imbalanced-learn is a scikit-learn-compatible Python library providing over- and under-sampling techniques, ensemble methods, and pipeline utilities for learning from imbalanced datasets.

## Install

Save in your project root:

# imbalanced-learn — Handle Imbalanced Datasets in Python

## Quick Use
```bash
pip install imbalanced-learn
```
```python
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1])
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
print(f"Before: {sum(y==1)}, After: {sum(y_res==1)}")
```

## Introduction
imbalanced-learn is a Python package that extends scikit-learn with resampling techniques designed for imbalanced classification problems. When one class vastly outnumbers the other (fraud detection, medical diagnosis, anomaly detection), standard classifiers tend to ignore the minority class. imbalanced-learn provides tools to rebalance the dataset before training.

## What imbalanced-learn Does
- Over-sampling methods: SMOTE, ADASYN, BorderlineSMOTE, and random oversampling
- Under-sampling methods: Tomek links, edited nearest neighbours, random undersampling
- Combination methods that chain over- and under-sampling
- Ensemble classifiers: BalancedRandomForest, EasyEnsemble, BalancedBagging
- Pipeline integration with scikit-learn for clean preprocessing workflows

## Architecture Overview
The library mirrors scikit-learn's API with fit_resample() for samplers and standard fit/predict for ensemble methods. Samplers implement a base class with consistent interfaces for oversampling (generating synthetic minority examples), undersampling (removing majority examples), or combinations. All samplers work with NumPy arrays and pandas DataFrames and integrate into imblearn.pipeline.Pipeline for end-to-end workflows.

## Self-Hosting & Configuration
- Install via pip: `pip install imbalanced-learn`
- Requires Python 3.8+, scikit-learn, NumPy, SciPy, and joblib
- No external services or GPU needed
- Configure sampling strategies via ratio parameters (auto, float, or dict)
- Drop-in replacement for scikit-learn pipelines

## Key Features
- Full scikit-learn API compatibility with fit_resample pattern
- Multiple SMOTE variants for different data distributions
- Ensemble methods designed specifically for imbalanced learning
- Works with NumPy arrays and pandas DataFrames
- Extensive documentation with practical examples and benchmarks

## Comparison with Similar Tools
- **scikit-learn** — provides class_weight parameter but no resampling; imbalanced-learn adds dedicated samplers
- **SMOTE (standalone)** — imbalanced-learn bundles SMOTE plus many variants and undersampling methods
- **PyOD** — focuses on outlier detection; imbalanced-learn targets supervised classification
- **XGBoost scale_pos_weight** — model-level fix; imbalanced-learn operates at the data level

## FAQ
**Q: When should I use oversampling vs undersampling?**
A: Oversampling (SMOTE) works well with small datasets where losing samples is costly. Undersampling is faster and suitable when you have abundant majority samples.

**Q: Can I use imbalanced-learn in a scikit-learn pipeline?**
A: Yes. Use imblearn.pipeline.Pipeline instead of sklearn.pipeline.Pipeline. It handles the fit_resample step automatically.

**Q: Does SMOTE work with categorical features?**
A: Use SMOTENC (for mixed numeric/categorical) or SMOTEN (for purely categorical). Standard SMOTE handles only numeric features.

**Q: Does it work with multi-class problems?**
A: Yes. Most samplers support multi-class by resampling each class independently based on the strategy parameter.

## Sources
- https://github.com/scikit-learn-contrib/imbalanced-learn
- https://imbalanced-learn.org/

---
Source: https://tokrepo.com/en/workflows/asset-1cf4ecef
Author: AI Open Source