# scikit-learn — Machine Learning in Python Made Simple

> scikit-learn is the most widely used machine learning library in Python. It provides simple and efficient tools for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing — all with a consistent API.

## Install

Save in your project root:

# scikit-learn — Machine Learning in Python Made Simple

## Quick Use
```bash
# Install scikit-learn
pip install scikit-learn

# Quick classification example
python3 -c "
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
print(f'Accuracy: {accuracy_score(y_test, clf.predict(X_test)):.2f}')
"
```

## Introduction
scikit-learn is the foundational machine learning library for Python. Built on NumPy, SciPy, and matplotlib, it provides a unified API for classical machine learning algorithms — from simple linear regression to complex ensemble methods. Its consistent fit/predict interface makes it possible to swap algorithms with a single line of code.

With over 66,000 GitHub stars and 20+ years of development, scikit-learn is used by data scientists, researchers, and engineers worldwide. It remains the go-to library for traditional ML tasks even as deep learning frameworks have emerged.

## What scikit-learn Does
scikit-learn provides tools for every step of the machine learning workflow: data preprocessing (scaling, encoding), feature selection, model training, hyperparameter tuning, cross-validation, and evaluation metrics. Every algorithm follows the same API pattern: instantiate, fit, predict.

## Architecture Overview
```
[scikit-learn Pipeline]

Data --> [Preprocessing]     [Model Selection]
         StandardScaler       GridSearchCV
         LabelEncoder         cross_val_score
         OneHotEncoder        train_test_split
              |                     |
         [Feature Engineering]     |
         PCA, SelectKBest         |
         PolynomialFeatures       |
              |                     |
         [Estimators] <-----------+
         Classification: SVM, RF, KNN, LogReg
         Regression: Linear, Ridge, SVR, GBR
         Clustering: KMeans, DBSCAN, Hierarchical
         Decomposition: PCA, NMF, t-SNE
              |
         [Evaluation]
         accuracy, f1, ROC-AUC,
         confusion_matrix, MSE
```

## Self-Hosting & Configuration
```python
# Complete ML pipeline example
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import joblib

# Build a pipeline
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=10)),
    ("classifier", SVC())
])

# Hyperparameter search
param_grid = {
    "pca__n_components": [5, 10, 20],
    "classifier__C": [0.1, 1, 10],
    "classifier__kernel": ["rbf", "linear"]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

# Save model
joblib.dump(grid_search.best_estimator_, "model.pkl")
```

## Key Features
- **Consistent API** — every algorithm follows fit/predict/transform pattern
- **Classification** — SVM, Random Forest, Gradient Boosting, KNN, Logistic Regression
- **Regression** — Linear, Ridge, Lasso, SVR, Gradient Boosting Regressor
- **Clustering** — KMeans, DBSCAN, Hierarchical, Spectral Clustering
- **Dimensionality Reduction** — PCA, t-SNE, UMAP (via umap-learn)
- **Model Selection** — cross-validation, grid search, randomized search
- **Preprocessing** — scaling, encoding, imputation, feature extraction
- **Pipeline System** — chain preprocessing and models into reproducible workflows

## Comparison with Similar Tools
| Feature | scikit-learn | XGBoost | PyTorch | TensorFlow | LightGBM |
|---|---|---|---|---|---|
| Focus | Classical ML | Gradient Boosting | Deep Learning | Deep Learning | Gradient Boosting |
| Learning Curve | Very Low | Low | Moderate | Moderate | Low |
| GPU Support | No | Yes | Yes | Yes | Yes |
| API Style | fit/predict | fit/predict | Training loop | Keras/fit | fit/predict |
| Best For | Tabular data, Prototyping | Competitions, Tabular | Vision, NLP, Research | Production DL | Large datasets |
| Interpretability | High | Moderate | Low | Low | Moderate |

## FAQ
**Q: When should I use scikit-learn vs deep learning?**
A: Use scikit-learn for tabular/structured data, small-to-medium datasets, and when interpretability matters. Use deep learning (PyTorch/TensorFlow) for images, text, audio, and very large datasets where neural networks excel.

**Q: How do I handle large datasets that do not fit in memory?**
A: Use partial_fit for incremental learning (supported by SGDClassifier, MiniBatchKMeans, etc.), or use libraries like Dask-ML that extend scikit-learn to distributed computing.

**Q: Can I use scikit-learn in production?**
A: Yes. Serialize models with joblib or pickle, serve via FastAPI or Flask, and use scikit-learn pipelines to ensure consistent preprocessing. For high-throughput, consider ONNX export.

**Q: What is the best algorithm for my problem?**
A: Start with the scikit-learn algorithm cheat sheet. For tabular classification: try Random Forest or Gradient Boosting first. For regression: start with Ridge Regression. Always cross-validate.

## Sources
- GitHub: https://github.com/scikit-learn/scikit-learn
- Documentation: https://scikit-learn.org
- Created by David Cournapeau, maintained by community
- License: BSD 3-Clause

---
Source: https://tokrepo.com/en/workflows/0fe55648-366d-11f1-9bc6-00163e2b0d79
Author: AI Open Source