Bonsai: Gradient Boosted Trees + Bayesian Optimization

Overview

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

Despite being a very small package, it has access to nearly all of the configurable parameters in XGBoost and CatBoost as well as the BayesianOptimization package allowing users to specify unique objectives, metrics, parameter search ranges, and search policies. This is made possible thanks to the strong similarities between both libraries.

$ pip install bonsai-tree

References/Dependencies:

Why use Bonsai?

Grid search and random search are the most commonly used algorithms for exploring the hyperparameter space for a wide range of machine learning models. While effective for optimizing over low dimensional hyperparameter spaces (ex: few regularization terms), these methods do not scale well to models with a large number of hyperparameters such as gradient boosted trees.

Bayesian optimization on the other hand dynamically samples from the hyperparameter space with the goal of minimizing uncertaintly about the underlying objective function. For the case of model optimization, this consists of iteratively building a prior distribution of functions over the hyperparameter space and sampling with the goal of minimizing the posterior variance of the loss surface (via Gaussian Processes).

Model Configuration

Since Bonsai is simply a wrapper for both XGBoost and CatBoost, the model_params dict is synonymous with the params argument for both catboost.fit() and xgboost.fit(). Additionally, you must encode your categorical features as usual depending on which library you are using (XGB: One-Hot, CB: Label).

Below is a simple example of binary classification using CatBoost:

# label encoded training data
X = train.drop(target, axis = 1)
y = train[target]

# same args as catboost.train(...)
model_params = dict(objective = 'Logloss', verbose = False)

# same args as catboost.cv(...)
cv_params = dict(nfold = 5)

The pbounds dict as seen below specifies the hyperparameter bounds over which the optimizer will search. Additionally, the opt_config dictionary is for configuring the optimizer itself. Refer to the BayesianOptimization documentation to learn more.

# defining parameter search ranges
pbounds = dict(
  eta = (0.15, 0.4), 
  n_estimators = (200,2000), 
  max_depth = (4, 8)
)

# 10 warm up samples + 10 optimizing steps
n_iter, init_points= 10, 10

# to learn more about customizing your search policy:
# BayesianOptimization/examples/exploitation_vs_exploration.ipynb
opt_config = dict(acq = 'ei', xi = 1e-2)

Tuning and Prediction

All that is left is to initialize and optimize.

from bonsai.tune import CB_Tuner

# note that 'cats' is a list of categorical feature names
tuner = CB_Tuner(X, y, cats, model_params, cv_params, pbounds)
tuner.optimize(n_iter, init_points, opt_config, bounds_transformer)

After the optimal parameters are found, the model is trained and stored internally giving full access to the CatBoost model.

test_pool = catboost.Pool(test, cat_features = cats)
preds = tuner.model.predict(test_pool, prediction_type = 'Probability')

Bonsai also comes with a parallel coordinates plotting functionality allowing users to further narrow down their parameter search ranges as needed.

from bonsai.utils import parallel_coordinates

# DataFrame with hyperparams and observed loss
results = tuner.opt_results
parallel_coordinates(results)

Owner
Landon Buechner
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
Diabetes Prediction with Logistic Regression

Diabetes Prediction with Logistic Regression Exploratory Data Analysis Data Preprocessing Model & Prediction Model Evaluation Model Validation: Holdou

AZİZE SULTAN PALALI 2 Oct 23, 2021
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 663 Dec 31, 2022
jaxfg - Factor graph-based nonlinear optimization library for JAX.

Factor graphs + nonlinear optimization in JAX

Brent Yi 134 Dec 21, 2022
The Ultimate FREE Machine Learning Study Plan

The Ultimate FREE Machine Learning Study Plan

Patrick Loeber (Python Engineer) 2.5k Jan 05, 2023
A demo project to elaborate how Machine Learn Models are deployed on production using Flask API

This is a salary prediction website developed with the help of machine learning, this makes prediction of salary on basis of few parameters like interview score, experience test score.

1 Feb 10, 2022
Markov bot - A Writing bot based on Markov Chain for Data Structure Lab

基于马尔可夫链的写作机器人 前端 用html/css完成 Demo展示(已给出文本的相应展示) 用户提供相关的语料库后训练的成果 后端 要完成的几个接口 解析文

DysprosiumDy 9 May 05, 2022
Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

Arjun U 1 Jan 23, 2022
Predicting India’s COVID-19 Third Wave with LSTM

Predicting India’s COVID-19 Third Wave with LSTM Complete project of predicting new COVID-19 cases in the next 90 days with LSTM India is seeing a ste

Samrat Dutta 4 Jan 27, 2022
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms Based on the work by Smith et al. (2021) Query

5 Aug 06, 2022
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

🤖 Interactive Machine Learning experiments: 🏋️models training + 🎨models demo

Oleksii Trekhleb 1.4k Jan 06, 2023
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022
Adaptive: parallel active learning of mathematical functions

adaptive Adaptive: parallel active learning of mathematical functions. adaptive is an open-source Python library designed to make adaptive parallel fu

741 Dec 27, 2022
A Time Series Library for Apache Spark

Flint: A Time Series Library for Apache Spark The ability to analyze time series data at scale is critical for the success of finance and IoT applicat

Two Sigma 970 Jan 04, 2023
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy m

Robin 55 Dec 27, 2022
GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

pm-prophet Pymc3-based universal time series prediction and decomposition library (inspired by Facebook Prophet). However, while Faceook prophet is a

Luca Giacomel 314 Dec 25, 2022
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
Xeasy-ml is a packaged machine learning framework.

xeasy-ml 1. What is xeasy-ml Xeasy-ml is a packaged machine learning framework. It allows a beginner to quickly build a machine learning model and use

9 Mar 14, 2022
Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models.

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow scikit-learn's functionality wit

Soledad Galli 33 Dec 27, 2022