pywFM is a Python wrapper for Steffen Rendle's factorization machines library libFM

Related tags

Machine LearningpywFM
Overview

pywFM

pywFM is a Python wrapper for Steffen Rendle's libFM. libFM is a Factorization Machine library:

Factorization machines (FM) are a generic approach that allows to mimic most factorization models by feature engineering. This way, factorization machines combine the generality of feature engineering with the superiority of factorization models in estimating interactions between categorical variables of large domain. libFM is a software implementation for factorization machines that features stochastic gradient descent (SGD) and alternating least squares (ALS) optimization as well as Bayesian inference using Markov Chain Monte Carlo (MCMC).

For more information regarding Factorization machines and libFM, read Steffen Rendle's paper: Factorization Machines with libFM, in ACM Trans. Intell. Syst. Technol., 3(3), May. 2012

Don't forget to acknowledge libFM (i.e. cite the paper Factorization Machines with libFM) if you publish results produced with this software.

Motivation

While using Python implementations of Factorization Machines, I felt that the current implementations (pyFM and fastFM) had many flaws. Then I though, why re-invent the wheel? Why not use the original libFM?

Sure, it's not Python native yada yada ... But at least we have a bulletproof, battle-tested implementation that we can guide ourselves with.

Installing

First you have to clone and compile libFM repository and set an environment variable to the libFM bin folder:

git clone https://github.com/srendle/libfm /home/libfm
cd /home/libfm/
# taking advantage of a bug to allow us to save model #ShameShame
git reset --hard 91f8504a15120ef6815d6e10cc7dee42eebaab0f
make all
export LIBFM_PATH=/home/libfm/bin/

Make sure you are compiling source from libfm repository and at this specific commit, since pywFM needs the save_model. Beware that the installers and source code in libfm.org are both dated before this commit. I know this is extremely hacky, but since a fix was deployed it only allows the save_model option for SGD or ALS. I don't know why exactly, because it was working well before.

If you use Jupyter take a look at the following issue for some extra notes on getting libfm to work.

Then, install pywFM using pip:

pip install pywFM

Binary installers for the latest released version are available at the Python package index.

Dependencies

  • numpy
  • scipy
  • sklearn
  • pandas

Example

Very simple example taken from Steffen Rendle's paper: Factorization Machines with libFM.

import pywFM
import numpy as np
import pandas as pd

features = np.matrix([
#     Users  |     Movies     |    Movie Ratings   | Time | Last Movies Rated
#    A  B  C | TI  NH  SW  ST | TI   NH   SW   ST  |      | TI  NH  SW  ST
    [1, 0, 0,  1,  0,  0,  0,   0.3, 0.3, 0.3, 0,     13,   0,  0,  0,  0 ],
    [1, 0, 0,  0,  1,  0,  0,   0.3, 0.3, 0.3, 0,     14,   1,  0,  0,  0 ],
    [1, 0, 0,  0,  0,  1,  0,   0.3, 0.3, 0.3, 0,     16,   0,  1,  0,  0 ],
    [0, 1, 0,  0,  0,  1,  0,   0,   0,   0.5, 0.5,   5,    0,  0,  0,  0 ],
    [0, 1, 0,  0,  0,  0,  1,   0,   0,   0.5, 0.5,   8,    0,  0,  1,  0 ],
    [0, 0, 1,  1,  0,  0,  0,   0.5, 0,   0.5, 0,     9,    0,  0,  0,  0 ],
    [0, 0, 1,  0,  0,  1,  0,   0.5, 0,   0.5, 0,     12,   1,  0,  0,  0 ]
])
target = [5, 3, 1, 4, 5, 1, 5]

fm = pywFM.FM(task='regression', num_iter=5)

# split features and target for train/test
# first 5 are train, last 2 are test
model = fm.run(features[:5], target[:5], features[5:], target[5:])
print(model.predictions)
# you can also get the model weights
print(model.weights)

You can also use numpy's array, sklearn's sparse_matrix, and even pandas' DataFrame as features input.

Prediction on new data

Current approach is to send test data as x_test, y_test in run method call. libfm uses the test values to output some results regarding its predictions. They are not used when training the model. y_test can be set as dummy value and just collect the predictions with model.predictions (also disregard the prediction statistics since those will be wrong). For more info check libfm manual.

Running against a new dataset using something like a predict method is not supported yet. Pending feature request: https://github.com/jfloff/pywFM/issues/7

Feel free to PR that change ;)

Usage

Don't forget to acknowledge libFM (i.e. cite the paper Factorization Machines with libFM) if you publish results produced with this software.

FM: Class that wraps libFM parameters. For more information read libFM manual
Parameters
----------
task : string, MANDATORY
        regression: for regression
        classification: for binary classification
num_iter: int, optional
    Number of iterations
    Defaults to 100
init_stdev : double, optional
    Standard deviation for initialization of 2-way factors
    Defaults to 0.1
k0 : bool, optional
    Use bias.
    Defaults to True
k1 : bool, optional
    Use 1-way interactions.
    Defaults to True
k2 : int, optional
    Dimensionality of 2-way interactions.
    Defaults to 8
learning_method: string, optional
    sgd: parameter learning with SGD
    sgda: parameter learning with adpative SGD
    als: parameter learning with ALS
    mcmc: parameter learning with MCMC
    Defaults to 'mcmc'
learn_rate: double, optional
    Learning rate for SGD
    Defaults to 0.1
r0_regularization: int, optional
    bias regularization for SGD and ALS
    Defaults to 0
r1_regularization: int, optional
    1-way regularization for SGD and ALS
    Defaults to 0
r2_regularization: int, optional
    2-way regularization for SGD and ALS
    Defaults to 0
rlog: bool, optional
    Enable/disable rlog output
    Defaults to True.
verbose: bool, optional
    How much infos to print
    Defaults to False.
seed: int, optional
    seed used to reproduce the results
    Defaults to None.
silent: bool, optional
    Completly silences all libFM output
    Defaults to False.
temp_path: string, optional
    Sets path for libFM temporary files. Usefull when dealing with large data.
    Defaults to None (default mkstemp behaviour)
FM.run: run factorization machine model against train and test data

Parameters
----------
x_train : {array-like, matrix}, shape = [n_train, n_features]
    Training data
y_train : numpy array of shape [n_train]
    Target values
x_test: {array-like, matrix}, shape = [n_test, n_features]
    Testing data
y_test : numpy array of shape [n_test]
    Testing target values
x_validation_set: optional, {array-like, matrix}, shape = [n_train, n_features]
    Validation data (only for SGDA)
y_validation_set: optional, numpy array of shape [n_train]
    Validation target data (only for SGDA)

Return
-------
Returns `namedtuple` with the following properties:

predictions: array [n_samples of x_test]
   Predicted target values per element in x_test.
global_bias: float
    If k0 is True, returns the model's global bias w0
weights: array [n_features]
    If k1 is True, returns the model's weights for each features Wj
pairwise_interactions: numpy matrix [n_features x k2]
    Matrix with pairwise interactions Vj,f
rlog: pandas dataframe [nrow = num_iter]
    `pandas` DataFrame with measurements about each iteration

Docker

This repository includes Dockerfile for development and for running pywFM.

  • Run pywFM examples (Dockerfile): if you are only interested in running the examples, you can use the pre-build image availabe in Docker Hub:
# to run examples/simple.py (the one in this README).
docker run --rm -v "$(pwd)":/home/pywfm -w /home/pywfm -ti jfloff/pywfm python examples/simple.py
  • Development of pywFM (Dockerfile): useful if you want to make changes to the repo. Dockerfile defaults to bash.
# to build image
docker build --rm=true -t jfloff/pywfm-dev .
# to run image
docker run --rm -v "$(pwd)":/home/pywfm-dev -w /home/pywfm-dev -ti jfloff/pywfm-dev

Future work

  • Improve the save_model / load_model so we can have a more defined init-fit-predict cycle (perhaps we could inherit from sklearn.BaseEstimator)
  • Can we contribute to libFM repo so save_model is enabled for all learning methods (namely MCMC)?
  • Look up into shared library solution to improve I/O overhead

I'm no factorization machine expert, so this library was just an effort to have libFM as fast as possible in Python. Feel free to suggest features, enhancements; to point out issues; and of course, to post PRs.

License

MIT (see LICENSE.txt file)

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla Statistical ⚡️ Forecast Lightning fast forecasting with statistical and econometric models StatsForecast offers a collection of widely used uni

Nixtla 2.1k Dec 29, 2022
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

AI Fairness 360 (AIF360) The AI Fairness 360 toolkit is an extensible open-source library containg techniques developed by the research community to h

1.9k Jan 06, 2023
distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

Erdogan Taskesen 187 Dec 30, 2022
🔬 A curated list of awesome machine learning strategies & tools in financial market.

🔬 A curated list of awesome machine learning strategies & tools in financial market.

GeorgeZou 1.6k Dec 30, 2022
pure-predict: Machine learning prediction in pure Python

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks l

Ibotta 84 Dec 29, 2022
Tutorial for Decision Threshold In Machine Learning.

Decision-Threshold-ML Tutorial for improve skills: 'Decision Threshold In Machine Learning' (from GeeksforGeeks) by Marcus Mariano For more informatio

0 Jan 20, 2022
Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

Bottleneck Bottleneck is a collection of fast, NaN-aware NumPy array functions written in C. As one example, to check if a np.array has any NaNs using

Python for Data 835 Dec 27, 2022
My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

1 Oct 28, 2021
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

BASTA team 16 Nov 15, 2022
Made in collaboration with Chris George for Art + ML Spring 2019.

Deepdream Eyes Made in collaboration with Chris George for Art + ML Spring 2019.

Francisco Cabrera 1 Jan 12, 2022
PLUR is a collection of source code datasets suitable for graph-based machine learning.

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the

Google Research 76 Nov 25, 2022
Decision Weights in Prospect Theory

Decision Weights in Prospect Theory It's clear that humans are irrational, but how irrational are they? After some research into behavourial economics

Cameron Davidson-Pilon 32 Nov 08, 2021
This is a curated list of medical data for machine learning

Medical Data for Machine Learning This is a curated list of medical data for machine learning. This list is provided for informational purposes only,

Andrew L. Beam 5.4k Dec 26, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
💀mummify: a version control tool for machine learning

mummify is a version control tool for machine learning. It's simple, fast, and designed for model prototyping.

Max Humber 43 Jul 09, 2022
AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么? AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。 简单易用: AutoX的接口和sklearn类似,方便上手使用。

4Paradigm 431 Dec 28, 2022
cleanlab is the data-centric ML ops package for machine learning with noisy labels.

cleanlab is the data-centric ML ops package for machine learning with noisy labels. cleanlab cleans labels and supports finding, quantifying, and lear

Cleanlab 51 Nov 28, 2022