Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsApr 22, 2026·3 min de lecture

CatBoost — High-Performance Gradient Boosting with Native Categorical Support

CatBoost is an open-source gradient boosting library by Yandex that handles categorical features natively and delivers state-of-the-art accuracy on tabular data with minimal tuning.

Introduction

CatBoost (Categorical Boosting) is a gradient boosting library developed by Yandex that stands out for its native handling of categorical features without manual encoding. It consistently achieves top results in machine learning competitions on tabular data and is used in production at scale.

What CatBoost Does

  • Trains gradient boosted decision tree models for classification, regression, and ranking tasks
  • Handles categorical features natively using ordered target statistics, eliminating the need for one-hot encoding
  • Supports GPU training for large datasets with significant speedups over CPU
  • Provides built-in model analysis tools: feature importance, SHAP values, and object importance
  • Offers a ranking mode (YetiRank, PairLogit) for learning-to-rank applications

Architecture Overview

CatBoost builds oblivious decision trees (all splits at a given depth use the same feature and threshold), which enables fast inference via bit operations. Training uses ordered boosting to reduce prediction shift caused by target leakage in categorical encoding. The library is written in C++ with Python, R, Java, and CLI interfaces. GPU training uses CUDA kernels for histogram computation and tree construction, scaling to datasets with billions of samples.

Self-Hosting & Configuration

  • Install via pip: pip install catboost (CPU) or pip install catboost-gpu (GPU)
  • Specify categorical columns: model.fit(X, y, cat_features=[0, 3, 7])
  • Key hyperparameters: iterations, learning_rate, depth, l2_leaf_reg
  • Save models: model.save_model('model.cbm') with ONNX and CoreML export options
  • Use CatBoostPool for efficient data handling with categorical metadata

Key Features

  • Native categorical feature support without preprocessing or encoding
  • Ordered boosting reduces overfitting compared to standard gradient boosting
  • Oblivious tree structure enables fast inference and compact model files
  • Built-in cross-validation, early stopping, and overfitting detection
  • SHAP integration for interpretable model explanations out of the box

Comparison with Similar Tools

  • XGBoost — requires manual categorical encoding; CatBoost handles categories natively with less tuning
  • LightGBM — leaf-wise tree growth vs. CatBoost oblivious trees; LightGBM is often faster, CatBoost needs less preprocessing
  • scikit-learn GBM — slower and fewer features; CatBoost offers GPU support and categorical handling
  • AutoGluon — AutoML wrapper that can use CatBoost as one of its base models
  • TabNet — deep learning approach to tabular data; CatBoost is typically faster and more robust

FAQ

Q: Do I need to one-hot encode categorical features for CatBoost? A: No. Pass column indices via cat_features and CatBoost applies ordered target statistics internally.

Q: How does CatBoost compare to XGBoost and LightGBM in accuracy? A: On tabular benchmarks, all three are competitive. CatBoost often wins with less hyperparameter tuning, especially when the data has many categorical columns.

Q: Can CatBoost handle text features? A: Yes. CatBoost includes experimental text feature processing that tokenizes and embeds text columns during training.

Q: Is CatBoost suitable for real-time inference? A: Yes. Oblivious trees enable batch-level vectorized inference, and the C++ prediction library is fast enough for latency-sensitive applications.

Sources

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires