# TPOT — Automated Machine Learning with Genetic Programming

> TPOT uses genetic programming to automatically design and optimize machine learning pipelines, selecting the best models and preprocessing steps from scikit-learn.

## Install

Save in your project root:

# TPOT — Automated Machine Learning with Genetic Programming

## Quick Use
```bash
pip install tpot
python -c "
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    *load_iris(return_X_y=True), test_size=0.2)
clf = TPOTClassifier(generations=5, population_size=20, verbosity=2)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
clf.export('best_pipeline.py')
"
```

## Introduction
TPOT (Tree-based Pipeline Optimization Tool) automates the most tedious parts of machine learning by intelligently exploring thousands of possible pipeline configurations. It uses genetic programming to evolve scikit-learn pipelines, freeing data scientists from manual feature engineering and model selection.

## What TPOT Does
- Evolves complete ML pipelines using genetic programming
- Automatically selects preprocessing, feature engineering, and model steps
- Exports the best pipeline as a standalone Python script
- Supports classification and regression tasks out of the box
- Integrates with scikit-learn estimators and transformers

## Architecture Overview
TPOT represents each pipeline as a tree structure where nodes are scikit-learn operators. A genetic algorithm mutates, crosses over, and selects pipelines across generations. Fitness is evaluated via cross-validation. The final champion pipeline is exported as clean Python code using scikit-learn primitives.

## Self-Hosting & Configuration
- Install via pip with optional dependencies for XGBoost and DASK
- Set generations and population_size to control search thoroughness
- Use n_jobs=-1 to parallelize fitness evaluation across all cores
- Enable DASK backend for distributed pipeline search on clusters
- Configure scoring parameter to match your evaluation metric

## Key Features
- Zero-config AutoML that finds competitive pipelines automatically
- Exports reproducible Python code rather than opaque model objects
- Supports custom operator sets and search constraints
- Built-in stacking ensemble capabilities
- Warm-start to resume optimization from a previous run

## Comparison with Similar Tools
- **AutoGluon** — broader scope with tabular, text, and image; TPOT focuses on scikit-learn pipeline optimization
- **auto-sklearn** — also optimizes sklearn pipelines but uses Bayesian optimization; TPOT uses genetic programming
- **FLAML** — faster search via cost-frugal tuning; TPOT explores more pipeline structures
- **H2O AutoML** — requires the H2O server; TPOT runs in pure Python

## FAQ
**Q: How long does TPOT take to run?**
A: Depends on dataset size and generations setting. Small datasets can finish in minutes; large ones may need hours. Use max_time_mins to set a budget.

**Q: Can TPOT use GPUs?**
A: TPOT itself is CPU-based, but you can include XGBoost with GPU support as a custom operator.

**Q: Does TPOT support deep learning?**
A: TPOT focuses on traditional ML pipelines. For neural architecture search, consider other tools.

**Q: How do I interpret the exported pipeline?**
A: TPOT exports a plain Python script with scikit-learn imports that you can read, modify, and run independently.

## Sources
- https://github.com/EpistasisLab/tpot
- http://epistasislab.github.io/tpot/

---
Source: https://tokrepo.com/en/workflows/asset-6c515b5a
Author: AI Open Source