[ WSDM '22 ] On Sampling Collaborative Filtering Datasets

Last update: Dec 08, 2022

Related tags

Overview

On Sampling Collaborative Filtering Datasets

This repository contains the implementation of many popular sampling strategies, along with various explicit/implicit/sequential feedback recommendation algorithms. The code accompanies the paper "On Sampling Collaborative Filtering Datasets" [ACM] [Public PDF] where we compare the utility of different sampling strategies for preserving the performance of various recommendation algorithms.

We also provide code for Data-Genie which can automatically predict the performance of how good any sampling strategy will be for a given collaborative filtering dataset. We refer the reader to the full paper for more details. Kindly send me an email if you're interested in obtaining access to the pre-trained weights of Data-Genie.

If you find any module of this repository helpful for your own research, please consider citing the below WSDM'22 paper. Thanks!

@inproceedings{sampling_cf,
  author = {Noveen Sachdeva and Carole-Jean Wu and Julian McAuley},
  title = {On Sampling Collaborative Filtering Datasets},
  url = {https://doi.org/10.1145/3488560.3498439},
  booktitle = {Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining},
  series = {WSDM '22},
  year = {2022}
}

Code Author: Noveen Sachdeva ([email protected])

Setup

Environment Setup

$ pip install -r requirements.txt

Data Setup

Once you've correctly setup the python environments and downloaded the dataset of your choice (Amazon: http://jmcauley.ucsd.edu/data/amazon/), the following steps need to be run:

The following command will create the required data/experiment directories as well as download & preprocess the Amazon magazine and the MovieLens-100K datasets. Feel free to download more datasets from the following web-page http://jmcauley.ucsd.edu/data/amazon/ and adjust the setup.sh and preprocess.py files accordingly.

$ ./setup.sh

How to train a model on a sampled/complete CF-dataset?

Edit the hyper_params.py file which lists all config parameters, including what type of model to run. Currently supported models:

Sampling Strategy	What is sampled?	Paper Link
Random	Interactions
Stratified	Interactions
Temporal	Interactions
SVP-CF w/ MF	Interactions	LINK & LINK
SVP-CF w/ Bias-only	Interactions	LINK & LINK
SVP-CF-Prop w/ MF	Interactions	LINK & LINK
SVP-CF-Prop w/ Bias-only	Interactions	LINK & LINK
Random	Users
Head	Users
SVP-CF w/ MF	Users	LINK & LINK
SVP-CF w/ Bias-only	Users	LINK & LINK
SVP-CF-Prop w/ MF	Users	LINK & LINK
SVP-CF-Prop w/ Bias-only	Users	LINK & LINK
Centrality	Graph	LINK
Random-Walk	Graph	LINK
Forest-Fire	Graph	LINK

Finally, type the following command to run:

$ CUDA_VISIBLE_DEVICES=<SOME_GPU_ID> python main.py

Alternatively, to train various possible recommendation algorithm on various CF datasets/subsets, please edit the configuration in grid_search.py and then run:

$ python grid_search.py

How to train Data-Genie?

Edit the data_genie/data_genie_config.py file which lists all config parameters, including what datasets/CF-scenarios/samplers etc. to train Data-Genie on
Finally, use the following command to train Data-Genie:

$ python data_genie.py

License

MIT

[ WSDM '22 ] On Sampling Collaborative Filtering Datasets

Related tags

Overview

On Sampling Collaborative Filtering Datasets

Setup

Environment Setup

Data Setup

How to train a model on a sampled/complete CF-dataset?

How to train Data-Genie?

License

Owner

Noveen Sachdeva

This repository contains the segmentation user interface from the OpenSurfaces project, extracted as a lightweight tool

Code, Models and Datasets for OpenViDial Dataset

Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance Segmentation (ACM MM 2020)

Joint Unsupervised Learning (JULE) of Deep Representations and Image Clusters.

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Repository containing the PhD Thesis "Formal Verification of Deep Reinforcement Learning Agents"

The code for "Deep Level Set for Box-supervised Instance Segmentation in Aerial Images".

Caffe-like explicit model constructor. C(onfig)Model

Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

GeDML is an easy-to-use generalized deep metric learning library

This repository focus on Image Captioning & Video Captioning & Seq-to-Seq Learning & NLP

Code for "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" @ICRA2021

Lecture materials for Cornell CS5785 Applied Machine Learning (Fall 2021)

Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

Neural Surface Maps

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

A real world application of a Recurrent Neural Network on a binary classification of time series data

Automated Hyperparameter Optimization Competition

Diffgram - Supervised Learning Data Platform

YOLTv5 rapidly detects objects in arbitrarily large aerial or satellite images that far exceed the ~600×600 pixel size typically ingested by deep learning object detection frameworks