# scikit-learn — Machine Learning in Python Made Simple > scikit-learn is the most widely used machine learning library in Python. It provides simple and efficient tools for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing — all with a consistent API. ## Install Save in your project root: # scikit-learn — Machine Learning in Python Made Simple ## Quick Use ```bash # Install scikit-learn pip install scikit-learn # Quick classification example python3 -c " from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train, y_train) print(f'Accuracy: {accuracy_score(y_test, clf.predict(X_test)):.2f}') " ``` ## Introduction scikit-learn is the foundational machine learning library for Python. Built on NumPy, SciPy, and matplotlib, it provides a unified API for classical machine learning algorithms — from simple linear regression to complex ensemble methods. Its consistent fit/predict interface makes it possible to swap algorithms with a single line of code. With over 66,000 GitHub stars and 20+ years of development, scikit-learn is used by data scientists, researchers, and engineers worldwide. It remains the go-to library for traditional ML tasks even as deep learning frameworks have emerged. ## What scikit-learn Does scikit-learn provides tools for every step of the machine learning workflow: data preprocessing (scaling, encoding), feature selection, model training, hyperparameter tuning, cross-validation, and evaluation metrics. Every algorithm follows the same API pattern: instantiate, fit, predict. ## Architecture Overview ``` [scikit-learn Pipeline] Data --> [Preprocessing] [Model Selection] StandardScaler GridSearchCV LabelEncoder cross_val_score OneHotEncoder train_test_split | | [Feature Engineering] | PCA, SelectKBest | PolynomialFeatures | | | [Estimators] <-----------+ Classification: SVM, RF, KNN, LogReg Regression: Linear, Ridge, SVR, GBR Clustering: KMeans, DBSCAN, Hierarchical Decomposition: PCA, NMF, t-SNE | [Evaluation] accuracy, f1, ROC-AUC, confusion_matrix, MSE ``` ## Self-Hosting & Configuration ```python # Complete ML pipeline example from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV import joblib # Build a pipeline pipeline = Pipeline([ ("scaler", StandardScaler()), ("pca", PCA(n_components=10)), ("classifier", SVC()) ]) # Hyperparameter search param_grid = { "pca__n_components": [5, 10, 20], "classifier__C": [0.1, 1, 10], "classifier__kernel": ["rbf", "linear"] } grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="accuracy") grid_search.fit(X_train, y_train) print(f"Best params: {grid_search.best_params_}") print(f"Best score: {grid_search.best_score_:.3f}") # Save model joblib.dump(grid_search.best_estimator_, "model.pkl") ``` ## Key Features - **Consistent API** — every algorithm follows fit/predict/transform pattern - **Classification** — SVM, Random Forest, Gradient Boosting, KNN, Logistic Regression - **Regression** — Linear, Ridge, Lasso, SVR, Gradient Boosting Regressor - **Clustering** — KMeans, DBSCAN, Hierarchical, Spectral Clustering - **Dimensionality Reduction** — PCA, t-SNE, UMAP (via umap-learn) - **Model Selection** — cross-validation, grid search, randomized search - **Preprocessing** — scaling, encoding, imputation, feature extraction - **Pipeline System** — chain preprocessing and models into reproducible workflows ## Comparison with Similar Tools | Feature | scikit-learn | XGBoost | PyTorch | TensorFlow | LightGBM | |---|---|---|---|---|---| | Focus | Classical ML | Gradient Boosting | Deep Learning | Deep Learning | Gradient Boosting | | Learning Curve | Very Low | Low | Moderate | Moderate | Low | | GPU Support | No | Yes | Yes | Yes | Yes | | API Style | fit/predict | fit/predict | Training loop | Keras/fit | fit/predict | | Best For | Tabular data, Prototyping | Competitions, Tabular | Vision, NLP, Research | Production DL | Large datasets | | Interpretability | High | Moderate | Low | Low | Moderate | ## FAQ **Q: When should I use scikit-learn vs deep learning?** A: Use scikit-learn for tabular/structured data, small-to-medium datasets, and when interpretability matters. Use deep learning (PyTorch/TensorFlow) for images, text, audio, and very large datasets where neural networks excel. **Q: How do I handle large datasets that do not fit in memory?** A: Use partial_fit for incremental learning (supported by SGDClassifier, MiniBatchKMeans, etc.), or use libraries like Dask-ML that extend scikit-learn to distributed computing. **Q: Can I use scikit-learn in production?** A: Yes. Serialize models with joblib or pickle, serve via FastAPI or Flask, and use scikit-learn pipelines to ensure consistent preprocessing. For high-throughput, consider ONNX export. **Q: What is the best algorithm for my problem?** A: Start with the scikit-learn algorithm cheat sheet. For tabular classification: try Random Forest or Gradient Boosting first. For regression: start with Ridge Regression. Always cross-validate. ## Sources - GitHub: https://github.com/scikit-learn/scikit-learn - Documentation: https://scikit-learn.org - Created by David Cournapeau, maintained by community - License: BSD 3-Clause --- Source: https://tokrepo.com/en/workflows/0fe55648-366d-11f1-9bc6-00163e2b0d79 Author: AI Open Source