pure-predict: Machine learning prediction in pure Python

Overview
pure-predict

pure-predict: Machine learning prediction in pure Python

License Build Status PyPI Package Downloads Python Versions

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks like scikit-learn and fasttext. It implements the predict methods of these frameworks in pure Python.

Primary Use Cases

The primary use case for pure-predict is the following scenario:

  1. A model is trained in an environment without strong container footprint constraints. Perhaps a long running "offline" job on one or many machines where installing a number of python packages from PyPI is not at all problematic.
  2. At prediction time the model needs to be served behind an API. Typical access patterns are to request a prediction for one "record" (one "row" in a numpy array or one string of text to classify) per request or a mini-batch of records per request.
  3. Preferred infrastructure for the prediction service is either serverless (AWS Lambda) or a container service where the memory footprint of the container is constrained.
  4. The fitted model object's artifacts needed for prediction (coefficients, weights, vocabulary, decision tree artifacts, etc.) are relatively small (10s to 100s of MBs).
diagram

In this scenario, a container service with a large dependency footprint can be overkill for a microservice, particularly if the access patterns favor the pricing model of a serverless application. Additionally, for smaller models and single record predictions per request, the numpy and scipy functionality in the prediction methods of popular machine learning frameworks work against the application in terms of latency, underperforming pure python in some cases.

Check out the blog post for more information on the motivation and use cases of pure-predict.

Package Details

It is a Python package for machine learning prediction distributed under the Apache 2.0 software license. It contains multiple subpackages which mirror their open source counterpart (scikit-learn, fasttext, etc.). Each subpackage has utilities to convert a fitted machine learning model into a custom object containing prediction methods that mirror their native counterparts, but converted to pure python. Additionally, all relevant model artifacts needed for prediction are converted to pure python.

A pure-predict model object can then be pickled and later unpickled without any 3rd party dependencies other than pure-predict.

This eliminates the need to have large dependency packages installed in order to make predictions with fitted machine learning models using popular open source packages for training models. These dependencies (numpy, scipy, scikit-learn, fasttext, etc.) are large in size and not always necessary to make fast and accurate predictions. Additionally, they rely on C extensions that may not be ideal for serverless applications with a python runtime.

Quick Start Example

In a python enviornment with scikit-learn and its dependencies installed:

import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from pure_sklearn.map import convert_estimator

# fit sklearn estimator
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier()
clf.fit(X, y)

# convert to pure python estimator
clf_pure_predict = convert_estimator(clf)
with open("model.pkl", "wb") as f:
    pickle.dump(clf_pure_predict, f)

# make prediction with sklearn estimator
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

In a python enviornment with only pure-predict installed:

import pickle

# load pickled model
with open("model.pkl", "rb") as f:
    clf = pickle.load(f)

# make prediction with pure-predict object
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

Subpackages

pure_sklearn

Prediction in pure python for a subset of scikit-learn estimators and transformers.

  • estimators
    • linear models - supports the majority of linear models for classification
    • trees - decision trees, random forests, gradient boosting and xgboost
    • naive bayes - a number of popular naive bayes classifiers
    • svm - linear SVC
  • transformers
    • preprocessing - normalization and onehot/ordinal encoders
    • impute - simple imputation
    • feature extraction - text (tfidf, count vectorizer, hashing vectorizer) and dictionary vectorization
    • pipeline - pipelines and feature unions

Sparse data - supports a custom pure python sparse data object - sparse data is handled as would be expected by the relevent transformers and estimators

pure_fasttext

Prediction in pure python for fasttext.

  • supervised - predicts labels for supervised models; no support for quantized models (blocked by this issue)
  • unsupervised - lookup of word or sentence embeddings given input text

Installation

Dependencies

pure-predict requires:

Dependency Notes

  • pure_sklearn has been tested with scikit-learn versions >= 0.20 -- certain functionality may work with lower versions but are not guaranteed. Some functionality is explicitly not supported for certain scikit-learn versions and exceptions will be raised as appropriate.
  • xgboost requires version >= 0.82 for support with pure_sklearn.
  • pure-predict is not supported with Python 2.
  • fasttext versions <= 0.9.1 have been tested.

User Installation

The easiest way to install pure-predict is with pip:

pip install --upgrade pure-predict

You can also download the source code:

git clone https://github.com/Ibotta/pure-predict.git

Testing

With pytest installed, you can run tests locally:

pytest pure-predict

Examples

The package contains examples on how to use pure-predict in practice.

Calls for Contributors

Contributing to pure-predict is welcomed by any contributors. Specific calls for contribution are as follows:

  1. Examples, tests and documentation -- particularly more detailed examples with performance testing of various estimators under various constraints.
  2. Adding more pure_sklearn estimators. The scikit-learn package is extensive and only partially covered by pure_sklearn. Regression tasks in particular missing from pure_sklearn. Clustering, dimensionality reduction, nearest neighbors, feature selection, non-linear SVM, and more are also omitted and would be good candidates for extending pure_sklearn.
  3. General efficiency. There is likely low hanging fruit for improving the efficiency of the numpy and scipy functionality that has been ported to pure-predict.
  4. Threading could be considered to improve performance -- particularly for making predictions with multiple records.
  5. A public AWS lambda layer containing pure-predict.

Background

The project was started at Ibotta Inc. on the machine learning team and open sourced in 2020. It is currently maintained by the machine learning team at Ibotta.

Acknowledgements

Thanks to David Mitchell and Andrew Tilley for internal review before open source. Thanks to James Foley for logo artwork.

IbottaML
Owner
Ibotta
Ibotta
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Machine Learning Notebooks, 3rd edition This project aims at teaching you the fundamentals of Machine Learning in python. It contains the example code

Aurélien Geron 1.6k Jan 05, 2023
Cool Python features for machine learning that I used to be too afraid to use. Will be updated as I have more time / learn more.

python-is-cool A gentle guide to the Python features that I didn't know existed or was too afraid to use. This will be updated as I learn more and bec

Chip Huyen 3.3k Jan 05, 2023
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
Simple linear model implementations from scratch.

Hand Crafted Models Simple linear model implementations from scratch. Table of contents Overview Project Structure Getting started Citing this project

Jonathan Sadighian 2 Sep 13, 2021
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022
About Solve CTF offline disconnection problem - based on python3's small crawler

About Solve CTF offline disconnection problem - based on python3's small crawler, support keyword search and local map bed establishment, currently support Jianshu, xianzhi,anquanke,freebuf,seebug

天河 32 Oct 25, 2022
The Fuzzy Labs guide to the universe of open source MLOps

Open Source MLOps This is the Fuzzy Labs guide to the universe of free and open source MLOps tools. Contents What is MLOps, anyway? Data version contr

Fuzzy Labs 352 Dec 29, 2022
A single Python file with some tools for visualizing machine learning in the terminal.

Machine Learning Visualization Tools A single Python file with some tools for visualizing machine learning in the terminal. This demo is composed of t

Bram Wasti 35 Dec 29, 2022
AutoOED: Automated Optimal Experiment Design Platform

AutoOED is an optimal experiment design platform powered with automated machine learning to accelerate the discovery of optimal solutions. Our platform solves multi-objective optimization problems an

Yunsheng Tian 107 Jan 03, 2023
MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine Learning work with thousands of other users.

The collaboration platform for Machine Learning MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine

MLReef 1.4k Dec 27, 2022
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 65 Dec 20, 2022
Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort

Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort

2.3k Jan 04, 2023
MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

SUPSI-DACD-ISAAC 61 Dec 19, 2022
Katana project is a template for ASAP 🚀 ML application deployment

Katana project is a FastAPI template for ASAP 🚀 ML API deployment

Mohammad Shahebaz 100 Dec 26, 2022
Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Thines Kumar 1 Jan 31, 2022
LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms Based on the work by Smith et al. (2021) Query

5 Aug 06, 2022
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Target 696 Dec 26, 2022
虚拟货币(BTC、ETH)炒币量化系统项目。在一版本的基础上加入了趋势判断

🎉 第二版本 🎉 (现货趋势网格) 介绍 在第一版本的基础上 趋势判断,不在固定点位开单,选择更优的开仓点位 优势: 🎉 简单易上手 安全(不用将api_secret告诉他人) 如何启动 修改app目录下的authorization文件

幸福村的码农 250 Jan 07, 2023
Machine Learning University: Accelerated Natural Language Processing Class

Machine Learning University: Accelerated Natural Language Processing Class This repository contains slides, notebooks and datasets for the Machine Lea

AWS Samples 2k Jan 01, 2023
PySpark ML Bank Churn Prediction

PySpark-Bank-Churn Surname: corresponds to the record (row) number and has no effect on the output. CreditScore: contains random values and has no eff

kemalgunay 2 Nov 11, 2021