Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsApr 29, 2026·3 min de lectura

CatBoost — Gradient Boosting with Native Categorical Support

High-performance gradient boosting library by Yandex that handles categorical features natively without manual encoding and provides state-of-the-art accuracy on tabular data.

Introduction

CatBoost is a gradient boosting library developed by Yandex that handles categorical features natively without one-hot encoding or manual preprocessing. It uses ordered target statistics and oblivious decision trees to deliver high accuracy on tabular datasets while being robust to overfitting, even with default hyperparameters.

What CatBoost Does

  • Trains gradient-boosted decision tree models for classification, regression, and ranking tasks
  • Handles categorical features natively using ordered target encoding during training
  • Supports GPU-accelerated training for faster iteration on large datasets
  • Provides built-in tools for model analysis, feature importance, and SHAP values
  • Exports models to CoreML, ONNX, C++, and Python for production deployment

Architecture Overview

CatBoost builds an ensemble of oblivious decision trees, where all nodes at the same depth use the same split condition. This symmetric structure enables vectorized prediction and reduces overfitting. During training, it uses ordered boosting: each sample's residual is computed using a model trained only on preceding samples, preventing target leakage. Categorical features are encoded using per-category statistics computed with a random permutation ordering, eliminating the need for manual feature engineering.

Self-Hosting & Configuration

  • Install via pip or conda; GPU support requires CUDA toolkit
  • Pass cat_features parameter to specify which columns are categorical
  • Configure iterations, learning_rate, and depth via constructor or YAML config files
  • Use early_stopping_rounds with a validation set to prevent overfitting automatically
  • Enable GPU training with task_type='GPU' for datasets with millions of rows

Key Features

  • Native categorical feature handling eliminates manual encoding pipelines
  • Oblivious decision trees for fast and cache-friendly inference
  • Ordered boosting reduces prediction shift and target leakage
  • Built-in cross-validation, grid search, and feature importance analysis
  • Multi-platform deployment: export to ONNX, CoreML, C++ code, or a standalone evaluator

Comparison with Similar Tools

  • XGBoost — pioneered scalable gradient boosting; CatBoost adds native categorical support and ordered boosting
  • LightGBM — leaf-wise growth for speed; CatBoost's symmetric trees provide more stable default performance
  • scikit-learn GBM — simpler API but slower and no categorical handling; CatBoost is faster on large datasets
  • Random Forest — ensemble of independent trees; CatBoost's sequential boosting typically achieves higher accuracy
  • TabNet — deep learning for tabular data; CatBoost remains more accurate on most tabular benchmarks with less tuning

FAQ

Q: Do I need to encode categorical features before training? A: No. Pass the indices of categorical columns via cat_features, and CatBoost handles encoding internally using ordered target statistics.

Q: How does CatBoost compare to XGBoost and LightGBM on accuracy? A: On benchmarks with many categorical features, CatBoost often outperforms both. On purely numerical datasets, performance is comparable. All three are strong choices for tabular data.

Q: Does CatBoost support GPU training? A: Yes. Set task_type to GPU and CatBoost uses CUDA for training. GPU mode is especially beneficial for datasets with millions of rows.

Q: Can CatBoost handle missing values? A: Yes. CatBoost handles missing values natively by learning optimal split directions for absent feature values during training.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados