# LightGBM — Light Gradient Boosting Framework by Microsoft

> LightGBM is a fast, distributed gradient boosting framework by Microsoft that uses tree-based learning algorithms. It is designed for efficiency and scalability, handling large datasets with lower memory usage while maintaining high accuracy for classification, regression, and ranking tasks.

## Install

Save in your project root:

# LightGBM — Light Gradient Boosting Framework by Microsoft

## Quick Use
```bash
pip install lightgbm
python -c "
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = lgb.LGBMClassifier(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
print(f'Accuracy: {model.score(X_test, y_test):.3f}')
"
```

## Introduction
LightGBM is a gradient boosting framework that uses histogram-based algorithms and leaf-wise tree growth to train models faster than traditional approaches. Developed by Microsoft Research, it excels on large-scale tabular datasets and is widely used in Kaggle competitions, financial modeling, and recommendation systems.

## What LightGBM Does
- Trains gradient boosted decision trees using leaf-wise growth strategy for deeper, more accurate trees
- Handles large datasets efficiently with histogram-based split finding that bins continuous features
- Supports categorical features natively without one-hot encoding via optimal split algorithms
- Provides distributed and GPU-accelerated training for datasets with millions of rows
- Offers classification, regression, ranking (LambdaRank), and cross-entropy objectives

## Architecture Overview
LightGBM grows trees leaf-wise rather than level-wise, choosing the leaf with the maximum delta loss to split at each step. This produces deeper trees with fewer leaves for the same number of splits, often improving accuracy. It uses Gradient-based One-Side Sampling (GOSS) to focus on under-trained instances and Exclusive Feature Bundling (EFB) to reduce the number of features, together enabling faster training with minimal accuracy loss.

## Self-Hosting & Configuration
- Install via pip: `pip install lightgbm` or conda: `conda install -c conda-forge lightgbm`
- GPU build: `pip install lightgbm --install-option=--gpu` with OpenCL support
- Key parameters: `num_leaves` (default 31), `learning_rate`, `n_estimators`, `min_child_samples`
- Distributed training via MPI or Dask with `machine_type=mpi` in config
- Save models with `model.booster_.save_model('model.txt')` in human-readable text format

## Key Features
- Leaf-wise growth produces more accurate models than level-wise approaches given the same compute budget
- Histogram binning reduces memory from 8 bytes per feature to 1 byte, enabling larger datasets in RAM
- Native categorical feature support with optimal category-to-node assignment
- GOSS and EFB algorithms for 10-20x speedup on large datasets with negligible accuracy loss
- Scikit-learn compatible API plus a native training API with callbacks

## Comparison with Similar Tools
- **XGBoost** — level-wise growth is more robust on small data, but LightGBM is often faster on large datasets
- **CatBoost** — better default handling of categoricals and less prone to overfitting but slower training
- **scikit-learn GBM** — simpler but lacks histogram binning, GPU support, and distributed training
- **Random Forest** — easier to tune but generally less accurate than boosted tree ensembles
- **TabNet** — deep learning for tabular data with attention but harder to train and less consistent

## FAQ
**Q: When should I choose LightGBM over XGBoost?**
A: LightGBM tends to train faster on large datasets (100K+ rows) due to histogram binning and leaf-wise growth. XGBoost may be more robust on smaller datasets.

**Q: How do I prevent overfitting with leaf-wise growth?**
A: Limit `num_leaves` (start with 31-127), use `min_child_samples` (20+), and enable early stopping with a validation set.

**Q: Does LightGBM support GPU training?**
A: Yes, LightGBM has a GPU-accelerated histogram builder. Install the GPU build and set `device='gpu'` in parameters.

**Q: Can LightGBM handle missing values?**
A: Yes, LightGBM handles missing values natively by learning the optimal direction for missing values at each split.

## Sources
- https://github.com/microsoft/LightGBM
- https://lightgbm.readthedocs.io

---
Source: https://tokrepo.com/en/workflows/4749df77-3d9d-11f1-9bc6-00163e2b0d79
Author: AI Open Source