UpliftML: A Python Package for Scalable Uplift Modeling

Overview

UpliftML: A Python Package for Scalable Uplift Modeling

upliftml

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Uplift modeling is a family of techniques for estimating the Conditional Average Treatment Effect (CATE) from experimental or observational data using machine learning. In particular, we are interested in estimating the causal effect of a treatment T on the outcome Y of an individual characterized by features X. In experimental data with binary treatments and binary outcomes, this is equivalent to estimating Pr(Y=1 | T=1, X=x) - Pr(Y=1 | T=0, X=x).

In many practical use cases the goal is to select which users to target in order to maximize the overall uplift without exceeding a specified budget or ROI constraint. In those cases, estimating uplift alone is not sufficient to make optimal decisions and we need to take into account the costs and monetary benefit incurred by the treatment.

Uplift modeling is an emerging tool for various personalization applications. Example use cases include marketing campaigns personalization and optimization, personalized pricing in e-commerce, and clinical treatment personalization.

The UpliftML library includes PySpark/H2O implementations for the following:

  • 6 metalearner approaches for uplift modeling: T-learner[1], S-learner[1], X-learner[1], R-learner[2], class variable transformation[3], transformed outcome approach[4].
  • The Retrospective Estimation[5] technique for uplift modeling under ROI constraints.
  • Uplift and iROI-based evaluation and plotting functions with bootstrapped confidence intervals. Currently implemented: ATE, ROI, iROI, CATE per category/quantile, CATE lift, Qini/AUUC curves[6], Qini/AUUC score[6], cumulative iROI curves.

For detailed information about the package, read the UpliftML documentation.

Installation

Install the latest release from PyPI:

$ pip install upliftml

Quick Start

from upliftml.models.pyspark import TLearnerEstimator
from upliftml.evaluation import estimate_and_plot_qini
from upliftml.datasets import simulate_randomized_trial
from pyspark.ml.classification import LogisticRegression


# Read/generate the dataset and convert it to Spark if needed
df_pd = simulate_randomized_trial(n=2000, p=6, sigma=1.0, binary_outcome=True)
df_spark = spark.createDataFrame(df_pd)

# Split the data into train, validation, and test sets
df_train, df_val, df_test = df_spark.randomSplit([0.5, 0.25, 0.25])

# Preprocess the datasets (for implementation of get_features_vector, see the full example notebook)
num_features = [col for col in df_spark.columns if col.startswith('feature')]
cat_features = []
df_train_assembled = get_features_vector(df_train, num_features, cat_features)
df_val_assembled = get_features_vector(df_val, num_features, cat_features)
df_test_assembled = get_features_vector(df_test, num_features, cat_features)

# Build a two-model estimator
model = TLearnerEstimator(base_model_class=LogisticRegression,
                          base_model_params={'maxIter': 15},
                          predictors_colname='features',
                          target_colname='outcome',
                          treatment_colname='treatment',
                          treatment_value=1,
                          control_value=0)
model.fit(df_train_assembled, df_val_assembled)

# Apply the model to test data
df_test_eval = model.predict(df_test_assembled)

# Evaluate performance on the test set
qini_values, ax = estimate_and_plot_qini(df_test_eval)

For complete examples with more estimators and evaluation functions, see the demo notebooks in the examples folder.

Contributing

If interested in contributing to the package, get started by reading our contributor guidelines.

License

The project is licensed under Apache 2.0 License

Citation

If you use UpliftML, please cite it as follows:

Irene Teinemaa, Javier Albert, Nam Pham. UpliftML: A Python Package for Scalable Uplift Modeling. https://github.com/bookingcom/upliftml, 2021. Version 0.0.1.

@misc{upliftml,
  author={Irene Teinemaa, Javier Albert, Nam Pham},
  title={{UpliftML}: {A Python Package for Scalable Uplift Modeling}},
  howpublished={https://github.com/bookingcom/upliftml},
  note={Version 0.0.1},
  year={2021}
}

Resources

Documentation:

Tutorials and blog posts:

Related packages:

  • CausalML: a Python package for uplift modeling and causal inference with machine learning
  • EconML: a Python package for estimating heterogeneous treatment effects from observational data via machine learning

References

  1. Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 2019.
  2. Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
  3. Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.
  4. Susan Athey and Guido W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5), 2015.
  5. Dmitri Goldenberg, Javier Albert, Lucas Bernardi, Pablo Estevez Castillo. Free Lunch! Retrospective Uplift Modeling for Dynamic Promotions Recommendation within ROI Constraints. In Fourteenth ACM Conference on Recommender Systems (pp. 486-491), 2020.
  6. Nicholas J Radcliffe and Patrick D Surry. Real-world uplift modelling with significance based uplift trees. White Paper tr-2011-1, Stochastic Solutions, 2011.
Owner
Booking.com
Open source projects and forks of projects we use internally (for better upstream collaboration)
Booking.com
Evaluate on three different ML model for feature selection using Breast cancer data.

Anomaly-detection-Feature-Selection Evaluate on three different ML model for feature selection using Breast cancer data. ML models: SVM, KNN and MLP.

Tarek idrees 1 Mar 17, 2022
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

Erdogan Taskesen 187 Dec 30, 2022
Getting Profit and Loss Make Easy From Binance

Getting Profit and Loss Make Easy From Binance I have been in Binance Automated Trading for some time and have generated a lot of transaction records,

17 Dec 21, 2022
A concept I came up which ditches the idea of "layers" in a neural network.

Dynet A concept I came up which ditches the idea of "layers" in a neural network. Install Copy Dynet.py to your project. Run the example Install matpl

Anik Patel 4 Dec 05, 2021
Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

Christoph Mark 129 Dec 24, 2022
A collection of machine learning examples and tutorials.

machine_learning_examples A collection of machine learning examples and tutorials.

LazyProgrammer.me 7.1k Jan 01, 2023
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

The Alan Turing Institute 6k Jan 06, 2023
Project to deploy a machine learning model based on Titanic dataset from Kaggle

kaggle_titanic_deploy Project to deploy a machine learning model based on Titanic dataset from Kaggle In this project we used the Titanic dataset from

Vivian Yamassaki 8 May 23, 2022
PySurvival is an open source python package for Survival Analysis modeling

PySurvival What is Pysurvival ? PySurvival is an open source python package for Survival Analysis modeling - the modeling concept used to analyze or p

Square 265 Dec 27, 2022
A Python step-by-step primer for Machine Learning and Optimization

early-ML Presentation General Machine Learning tutorials A Python step-by-step primer for Machine Learning and Optimization This github repository gat

Dimitri Bettebghor 8 Dec 01, 2022
Python module for machine learning time series:

seglearn Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extr

David Burns 536 Dec 29, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 03, 2023
This machine learning model was developed for House Prices

This machine learning model was developed for House Prices - Advanced Regression Techniques competition in Kaggle by using several machine learning models such as Random Forest, XGBoost and LightGBM.

serhat_derya 1 Mar 02, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 08, 2023
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 03, 2023
A complete guide to start and improve in machine learning (ML)

A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2021 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art

Louis-François Bouchard 3.3k Jan 04, 2023
Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Criteo 419 Jan 01, 2023
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Jan 06, 2023