CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system

Last update: Dec 28, 2022

Overview

CinnaMon

CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system. At its core, CinnaMon allows to study data drift between two given datasets. It is particularly useful in a monitoring context where the first dataset is the training (or validation) data and the second dataset is the production data.

⚡️ Quickstart

As a quick example, let's illustrate the use of CinnaMon on the breast cancer data where we voluntarily introduce some data drift.

Setup the data and build a model

>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from xgboost import XGBClassifier

# load breast cancer data
>>> dataset = datasets.load_breast_cancer()
>>> X = pd.DataFrame(dataset.data, columns = dataset.feature_names)
>>> y = dataset.target

# split data in train and valid dataset
>>> X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=2021)

# introduce some data drift in valid by filtering with 'worst symmetry' feature
>>> y_valid = y_valid[X_valid['worst symmetry'].values > 0.3]
>>> X_valid = X_valid.loc[X_valid['worst symmetry'].values > 0.3, :].copy()

# fit a XGBClassifier on the training data
>>> clf = XGBClassifier(use_label_encoder=False)
>>> clf.fit(X=X_train, y=y_train, verbose=10)

Initialize ModelDriftExplainer and fit it on train and validation data

>>> from cinnamon.drift import ModelDriftExplainer

# initialize a drift explainer with the built XGBClassifier and fit it on train
# and valid data
>>> drift_explainer = ModelDriftExplainer(model=clf)
>>> drift_explainer.fit(X1=X_train, X2=X_valid, y1=y_train, y2=y_valid)

Detect data drift by looking at main graphs and metrics

# Distribution of logit predictions
>>> drift_explainer.plot_prediction_drift(bins=15)

We can see on this graph that because of the data drift we introduced in validation data the distribution of predictions are different (they do not overlap well). We can also compute the corresponding drift metrics:

# Corresponding metrics
>>> drift_explainer.get_prediction_drift()
[{'mean_difference': -3.643428434667366,
  'wasserstein': 3.643428434667366,
  'kolmogorov_smirnov': KstestResult(statistic=0.2913775225333014, pvalue=0.00013914094110123454)}]

Comparing the distributions of predictions for two datasets is one of the main indicator we use in order to detect data drift. The two other indicators are:

distribution of the target (see get_target_drift)
performance metrics (see get_performance_metrics_drift)

Explain data drift by computing the drift values

Drift values can be thought as equivalent of feature importance but in terms of data drift.

# plot drift values
>>> drift_explainer.plot_tree_based_drift_values(n=7)

Here the feature worst symmetry is rightly identified as the one which contributes the most to the data drift.

See "notes" below to explore all the functionalities of CinnaMon.

🛠 Installation

CinnaMon is intended to work with Python 3.9 or above. Installation can be done with pip:

pip install cinnamon

🔗 Notes

The two main classes of CinnaMon are ModelDriftExplainer and AdversarialDriftExplainer
ModelDriftExplainer currently only support XGBoost models (both regression and classification are supported)
See notebooks in the examples/ directory to have an overview of all functionalities. Notably:
- Covariate shift example with IEEE data
- Concept drift example with IEEE data
These two notebooks also go deeper into the topic of how to correct data drift, making use of AdversarialDriftExplainer
See also the slide presentation of the CinnaMon library.
There is (yet) no formal documentation for CinnaMon but docstrings are up to date for the two main classes.

👍 Contributing

Check out the contribution section.

📝 License

CinnaMon is free and open-source software licensed under the MIT.

You might also like...

STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

2.5k Jan 6, 2023

🔬 A curated list of awesome machine learning strategies & tools in financial market.

1.6k Dec 30, 2022

Covid-polygraph - a set of Machine Learning-driven fact-checking tools

Covid-polygraph, a set of Machine Learning-driven fact-checking tools that aim to address the issue of misleading information related to COVID-19.

1 Apr 22, 2022

Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

47 Dec 17, 2022

Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

1 Jan 6, 2022

This is a Machine Learning model which predicts the presence of Diabetes in Patients

Diabetes Disease Prediction This is a machine Learning mode which tries to determine if a person has a diabetes or not. Data The dataset is in comma s

4 Mar 16, 2022

Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

3 Oct 19, 2022

Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

2 Jul 29, 2021

Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

175 Jan 8, 2023

Comments

Some feedback and some questions

Hi!

This looks like a great project! I have a few concerns about using a hypothesis based test for comparison of drift - reason being, how do you account for the multiple comparison's problem? https://en.wikipedia.org/wiki/Multiple_comparisons_problem

You do get some more explanatory power by looking at the plots, to be sure. I was thinking maybe you could include some permutation tests to deal with this, instead of relying on KS? Here is a reference: http://sia.webpopix.org/statisticalTests2.html and here is some in Python: https://ericschles.github.io/cuny_intro_to_ds_book/12/1/AB_Testing.html?highlight=permutation (important to note even though this is my teaching resource, it is lifted from some content from berkeley).

Anyway, great job!

opened by EricSchles 3
error after trying to execute the command: "from cinnamon.drift import ModelDriftExplainer"

[1I ] am getting the following error when trying to execute code from Quickstart or [breast_cancer_xgboost_binary_classif.ipynb] in a section containing "from cinnamon.drift import ModelDriftExplainer":

ModuleNotFoundError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_10348/627594479.py in 1 # Initialize ModelDriftExplainer and fit it on train and validation data ----> 2 from cinnamon.drift import ModelDriftExplainer 3 4 # initialize a drift explainer with the built XGBClassifier and fit it on train 5 # and valid data

~\AppData\Roaming\Python\Python39\site-packages\cinnamon\drift_init_.py in 1 from .adversarial_drift_explainer import AdversarialDriftExplainer ----> 2 from .model_drift_explainer import ModelDriftExplainer

~\AppData\Roaming\Python\Python39\site-packages\cinnamon\drift\model_drift_explainer.py in 7 from ..model_parser.i_model_parser import IModelParser 8 from .adversarial_drift_explainer import AdversarialDriftExplainer ----> 9 from ..model_parser.xgboost_parser import XGBoostParser 10 11 from .drift_utils import compute_drift_num, plot_drift_num

~\AppData\Roaming\Python\Python39\site-packages\cinnamon\model_parser\xgboost_parser.py in 2 import pandas as pd 3 from typing import Tuple ----> 4 from .single_tree import BinaryTree 5 import xgboost 6 from .abstract_tree_ensemble_parser import AbstractTreeEnsembleParser

~\AppData\Roaming\Python\Python39\site-packages\cinnamon\model_parser\single_tree.py in 1 import numpy as np ----> 2 from treelib import Tree 3 from ..common.constants import TreeBasedDriftValueType 4 5 class BinaryTree:

ModuleNotFoundError: No module named 'treelib'

[2] When I'm executing the code chunk "# fit an XGBClassifier on the training data" from "Quickstart" I've got this warning:

[20:53:12] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=6, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=None)

I use Python 3.8.8/ Win10 installed on the AMD Ryzen with integrated graphics (AMD). Environment: Anaconda

opened by tomaszek0 2
build(deps): bump pillow from 8.4.0 to 9.0.0
Bumps pillow from 8.4.0 to 9.0.0.

Release notes

Sourced from pillow's releases.

9.0.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.0.0.html

Changes

Restrict builtins for ImageMath.eval() #5923 [@radarhere]

Ensure JpegImagePlugin stops at the end of a truncated file #5921 [@radarhere]

Fixed ImagePath.Path array handling #5920 [@radarhere]

Remove consecutive duplicate tiles that only differ by their offset #5919 [@radarhere]

Removed redundant part of condition #5915 [@radarhere]

Explicitly enable strip chopping for large uncompressed TIFFs #5517 [@kmilos]

Use the Windows method to get TCL functions on Cygwin #5807 [@DWesl]

Changed error type to allow for incremental WebP parsing #5404 [@radarhere]

Improved I;16 operations on big endian #5901 [@radarhere]

Ensure that BMP pixel data offset does not ignore palette #5899 [@radarhere]

Limit quantized palette to number of colors #5879 [@radarhere]

Use latin1 encoding to decode bytes #5870 [@radarhere]

Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [@radarhere]

When saving RGBA to GIF, make use of first transparent palette entry #5859 [@radarhere]

Pass SAMPLEFORMAT to libtiff #5848 [@radarhere]

Added rounding when converting P and PA #5824 [@radarhere]

Improved putdata() documentation and data handling #5910 [@radarhere]

Exclude carriage return in PDF regex to help prevent ReDoS #5912 [@radarhere]

Image.NONE is only used for resampling and dithers #5908 [@radarhere]

Fixed freeing pointer in ImageDraw.Outline.transform #5909 [@radarhere]

Add Tidelift alignment action and badge #5763 [@aclark4life]

Replaced further direct invocations of setup.py #5906 [@radarhere]

Added ImageShow support for xdg-open #5897 [@m-shinder]

Fixed typo #5902 [@radarhere]

Switched from deprecated "setup.py install" to "pip install ." #5896 [@radarhere]

Support 16-bit grayscale ImageQt conversion #5856 [@cmbruns]

Fixed raising OSError in _safe_read when size is greater than SAFEBLOCK #5872 [@radarhere]

Convert subsequent GIF frames to RGB or RGBA #5857 [@radarhere]

WebP: Fix memory leak during decoding on failure #5798 [@ilai-deutel]

Do not prematurely return in ImageFile when saving to stdout #5665 [@infmagic2047]

Added support for top right and bottom right TGA orientations #5829 [@radarhere]

Corrected ICNS file length in header #5845 [@radarhere]

Block tile TIFF tags when saving #5839 [@radarhere]

Added line width argument to ImageDraw polygon #5694 [@radarhere]

Do not redeclare class each time when converting to NumPy #5844 [@radarhere]

Only prevent repeated polygon pixels when drawing with transparency #5835 [@radarhere]

Fix pushes_fd method signature #5833 [@hoodmane]

Add support for pickling TrueType fonts #5826 [@hugovk]

Only prefer command line tools SDK on macOS over default MacOSX SDK #5828 [@radarhere]

Fix compilation on 64-bit Termux #5793 [@landfillbaby]

Replace 'setup.py sdist' with '-m build --sdist' #5785 [@hugovk]

Use declarative package configuration #5784 [@hugovk]

Use title for display in ImageShow #5788 [@radarhere]

Fix for PyQt6 #5775 [@hugovk]

... (truncated)

Changelog

Sourced from pillow's changelog.

9.0.0 (2022-01-02)

Restrict builtins for ImageMath.eval(). CVE-2022-22817 #5923 [radarhere]

Ensure JpegImagePlugin stops at the end of a truncated file #5921 [radarhere]

Fixed ImagePath.Path array handling. CVE-2022-22815, CVE-2022-22816 #5920 [radarhere]

Remove consecutive duplicate tiles that only differ by their offset #5919 [radarhere]

Improved I;16 operations on big endian #5901 [radarhere]

Limit quantized palette to number of colors #5879 [radarhere]

Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [radarhere]

When saving RGBA to GIF, make use of first transparent palette entry #5859 [radarhere]

Pass SAMPLEFORMAT to libtiff #5848 [radarhere]

Added rounding when converting P and PA #5824 [radarhere]

Improved putdata() documentation and data handling #5910 [radarhere]

Exclude carriage return in PDF regex to help prevent ReDoS #5912 [hugovk]

Fixed freeing pointer in ImageDraw.Outline.transform #5909 [radarhere]

Added ImageShow support for xdg-open #5897 [m-shinder, radarhere]

Support 16-bit grayscale ImageQt conversion #5856 [cmbruns, radarhere]

Convert subsequent GIF frames to RGB or RGBA #5857 [radarhere]

... (truncated)

Commits

82541b6 9.0.0 version bump

cae5ac4 Merge pull request #5924 from radarhere/cves

ed4cf78 CVEs TBD

d7f60d1 Merge pull request #5923 from radarhere/imagemath_eval

8531b01 Restrict builtins for ImageMath.eval

1efb1d9 Merge pull request #5922 from radarhere/releasenotes

f6c7871 Added release notes for #5919, #5920 and #5921

032d2dc Update CHANGES.rst [ci skip]

baae9ec Merge pull request #5921 from radarhere/jpeg_eoi

1059eb5 If appended EOI did not work, do not keep trying

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
TypeError: predict() got an unexpected keyword argument 'iteration_range'
Hi cinnamon team, Firstly, thanks for bringing such a cool package!

I was working with your package and I have come across the following error. Then, I checked your example notebook examples/boston_XGBoost_ModelDriftExplainer.ipynb, to be sure whether I used it correctly, but got the same error:

TypeError: predict() got an unexpected keyword argument 'iteration_range'

Could you please let me know how to overcome this issue (maybe I am using an obsolete version of a package)?

Environment details:

macOS v.12.1

Python 3.8.8

cinnamon==0.1.2

xgboost==1.4.2

Thanks for your help in advance!
opened by furkanmtorun 0

Releases(0.2)

0.2(Dec 9, 2022)
Update to “ModelDriftExplainer”:

Add model agnostic support (deals with black box models / pipelines)

Add model specific support for CatBoost

Add support for categorical features

Add support for prediction_type = “class”

Create a documentation website.
Source code(tar.gz)
Source code(zip)

Owner

Zelros

IA for Augmented Insurers

GitHub Repository

SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

hexhamming What does it do? This module performs a fast bitwise hamming distance of two hexadecimal strings. This looks like: DEADBEEF = 1101111010101

12 Oct 14, 2022

An easier way to build neural search on the cloud

Jina is geared towards building search systems for any kind of data, including text, images, audio, video and many more. With the modular design & multi-layer abstraction, you can leverage the effici

17k Jan 01, 2023

Machine learning that just works, for effortless production applications

16 Sep 02, 2022

Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

Solar-radiation-ISB-MLOps - Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan.

1 Dec 31, 2021

scikit-learn is a python module for machine learning built on top of numpy / scipy

About scikit-learn is a python module for machine learning built on top of numpy / scipy. The purpose of the scikit-learn-tutorial subproject is to le

122 Dec 12, 2022

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Machine Learning Notebooks, 3rd edition This project aims at teaching you the fundamentals of Machine Learning in python. It contains the example code

1.6k Jan 05, 2023

Dive into Machine Learning

Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

11.1k Jan 03, 2023

Customers Segmentation with RFM Scores and K-means

Customer Segmentation with RFM Scores and K-means RFM Segmentation table: K-Means Clustering: Business Problem Rule-based customer segmentation machin

5 Aug 10, 2022

Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

1 Dec 22, 2021

The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it inside a loop of Design, Model Development and Operations.

MLOps The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it insid

25 Nov 27, 2022

Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

21 Jun 25, 2022

A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

194 Jan 04, 2023

A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

438 Dec 17, 2022

A simple guide to MLOps through ZenML and its various integrations.

ZenBytes Join our Slack Community and become part of the ZenML family Give the main ZenML repo a GitHub star to show your love ZenBytes is a series of

127 Dec 27, 2022

An implementation of Relaxed Linear Adversarial Concept Erasure (RLACE)

Background This repository contains an implementation of Relaxed Linear Adversarial Concept Erasure (RLACE). Given a dataset X of dense representation

4 Apr 13, 2022

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made th

73 Dec 01, 2022