The code submitted for the Analytics Vidhya Jobathon - February 2022

Overview

Introduction

On February 11th, 2022, Analytics Vidhya conducted a 3-day hackathon in data science. The top candidates had the chance to be selected by various participating companies across a multitude of roles, in the field of data science, analytics and business intelligence.

The objective of the hackathon was to develop a machine learning approach to predict the engagement between a user and a video. The training data already had an engagement score feature, which was a floating-point number between 0 and 5. This considerably simplified matters, as in recommender systems, calculating the engagement score is often more challenging than predicting them. The challenge, therefore, was to predict this score based on certain user related and video related features.

The list of features in the dataset is given below:

Variable Description
row_id Unique identifier of the row
user_id Unique identifier of the user
category_id Category of the video
video_id Unique identifier of the video
age Age of the user
gender Gender of the user (Male and Female)
profession Profession of the user (Student, Working Professional, Other)
followers No. of users following a particular category
views Total views of the videos present in the particular category
engagement_score Engagement score of the video for a user

Initial Ideas

Two main approaches were considered:

  • Regression Models : Since the engagement score was a continuous variable, one could use a regression model. There were several reasons that I did not use this method:

    1. The lack of features on which to build the model. The features "user_id", "category_id" and "video_id" were discrete features that would need to be encoded. Since each of these features had many unique values, using a simple One-Hot Encoding would not work due to the increase in the number of dimensions. I also considered using other categorical encoders, like the CatBoost encoder, however, I've found that most categorical encoders only do well if the target value is a categorical variable, not a continuous variable like the case here. This would leave us with only 5 features on which to make the model, which never felt enough. The 5 features also had very little scope for feature engineering, apart from some ideas on combining gender and profession and age and profession.
    2. The fact that traditional regression models have never done well as recommender systems. The now famous Netflix challenge clearly showed the advantage collaborative filtering methods had over simple regressor models.
  • Collaborative Filtering : This was the option I eventually went with. There was enough evidence to see that Collaborative Filtering was in almost all cases better than regression models. There are many different ways of implementing collaborative filtering, of which I finally decided to use the Matrix Factorization method (SVD).

Final Model

The first few runs of collaborative filtering were not very successful, with a low r2 score on the test set. However, running a Grid Search over the 3 main hyperparameters – the number of factors, the number of epochs and the learning rate, soon gave the optimal SVD. The final hyperparameters were:

  • Number of Factors: 100
  • Number of Epochs: 500,
  • Learning Rate: 0.05

The r2 score on the test set at this point was 0.506, which took me to the top of the leaderboard.

The next step was to try to improve the model further. I decided to model the errors of the SVD model, so that the predictions of the SVD could be further adjusted by the error estimates. After trying a few different models, I selected the Linear Regression model to predict the errors.

To generate the error, the SVD model first had to run on a subset of the training set, so that the model could predict the error on the validation set. On this validation set, the Linear Regression was trained. The SVD was run on 95% of the training set, therefore, the regression was done on only 5% of the entire training set. The steps in the process were:

  1. Get the engagement score predictions using the SVD model for the validation set.
  2. Calculate the error.
  3. Train the model on this validation set, using the features – "age", "followers", "views", "gender", "profession" and "initial_estimate". The target variable was the error.
  4. Finally, run both models on the actual test set, first the SVD, then the Linear Regression.
  5. The final prediction is the difference between the initial estimates and the weighted error estimates. The error estimates were given a weight of 5%, since that was the proportion of data on which the Linear Regression model was trained.
  6. There could be scenarios of the final prediction going above 5 or below 0. In such cases, adjust the prediction to either 5 or 0.

The final r2 score was 0.532, an increase of 2.9 points.

Ideas for Improvement

There are many ways I feel the model can be further improved. Some of them are:

  1. Choosing the Correct Regression Model to Predict the Error : It was quite unexpected that a weak learner like Linear Regression did better than stronger models like Random Forest and XGBoost. I feel that the main reason for this is that dataset used to train these regressors were quite small, only 5% of the entire training set. While the linear regression model worked well with such a small dataset, the more complicated models did not.
  2. Setting the Correct Subset for the SVD : After trying a few different values, the SVD subset was set at 95% while the error subset set at only 5%. The reason for setting such a high percentage was that the SVD was the more powerful algorithm and I wanted that to be as accurate as possible. However, this severely compromised the error predictor. Finding the perfect balance here could improve the model performance.
  3. Selecting the Correct Weights for the Final Prediction : The final prediction was the difference between the initial estimate and the weighted error estimate. Further analysis is needed to get the most optimum weights. Ideally, the weights should not be needed at all.
  4. Feature Engineering : The error estimator had no feature engineering at all, in fact, I removed the feature "category_id" as well. Adding new features could potentially help in improving the error estimates, however, the benefits would be low, as it accounts for only 5% of the final prediction.
RFDesign - Protein hallucination and inpainting with RoseTTAFold

RFDesign: Protein hallucination and inpainting with RoseTTAFold Jue Wang (juewan

139 Jan 06, 2023
Simple and easy to use python API for the COVID registration booking system of the math department @ unipd (torre archimede)

Simple and easy to use python API for the COVID registration booking system of the math department @ unipd (torre archimede). This API creates an interface with the official browser, with more useful

Guglielmo Camporese 4 Dec 24, 2021
A basic DIY-project made using Python and MySQL

Banking-Using-Python-MySQL This is a basic DIY-project made using Python and MySQL. Pre-Requisite needed:-- MySQL command Line:- creating a database

ABHISHEK 0 Jul 03, 2022
Powerful virtual assistant in python

Virtual assistant in python Powerful virtual assistant in python Set up Step 1: download repo and unzip Step 2: pip install requirements.txt (if py au

Arkal 3 Jan 23, 2022
Collection of Beginner to Intermediate level Python scripts contributed by members and participants.

Hacktoberfest2021-Python Hello there! This repository contains a 'Collection of Beginner to Intermediate level Python projects', created specially for

12 May 25, 2022
Providing a working, flexible, easier and faster installer than the one officially provided by Arch Linux

Purpose The purpose is to bring more people to Arch Linux by providing a working, flexible, easier and faster installer than the one officially provid

André Luís 0 Nov 09, 2022
ASCII-Wordle - A port of the game Wordle to terminal emulators/CMD

ASCII-Wordle A 'port' of Wordle to text-based interfaces A near-feature complete

32 Jun 11, 2022
Keep your company's passwords behind the firewall

TeamVault TeamVault is an open-source web-based shared password manager for behind-the-firewall installation. It requires Python 3.3+ and Postgres (wi

//SEIBERT/MEDIA GmbH 38 Feb 20, 2022
Source code for Learn Programming: Python

This repository contains the source code of the game engine behind Learn Programming: Python. The two key files are game.py (the main source of the ga

Niema Moshiri 25 Apr 24, 2022
Anki for desktop computers

Anki This repo contains the source code for the computer version of Anki. If you'd like to try development builds of Anki but don't feel comfortable b

Ankitects 12.9k Jan 09, 2023
bamboo-engine 是一个通用的流程引擎,他可以解析,执行,调度由用户创建的流程任务,并提供了如暂停,撤销,跳过,强制失败,重试和重入等等灵活的控制能力和并行、子流程等进阶特性,并可通过水平扩展来进一步提升任务的并发处理能力。

bamboo-engine 是一个通用的流程引擎,他可以解析,执行,调度由用户创建的流程任务,并提供了如暂停,撤销,跳过,强制失败,重试和重入等等灵活的控制能力和并行、子流程等进阶特性,并可通过水平扩展来进一步提升任务的并发处理能力。 整体设计 Quick start 1. 安装依赖 2. 项目初始

腾讯蓝鲸 96 Dec 15, 2022
Get a list of content on your Netflix My List that is expiring in the next month or two.

Netflix My List Expiring Movies Annoyed at Netflix for taking away your movies? Now you don't have to be! Installation instructions Install selenium C

24 Aug 06, 2022
Developing a python based app prototype with KivyMD framework for a competition :))

Developing a python based app prototype with KivyMD framework for a competition :))

Jay Desale 1 Jan 10, 2022
CD for MachineLearnia

Codebase supporting my talk on CI/CD for MachineLearnia (Nov 12 2021) The dataset used is available here. The point of the talk is to demonstrate a si

0 Feb 23, 2022
A program that lets you use your tablet's tilting to emulate an actual joystick on a Linux computer.

Tablet Tilt Joystick A program that lets you use your tablet's tilting to emulate an actual joystick on a Linux computer. It's called tablet tilt joys

1 Feb 07, 2022
An open-source Python project series where beginners can contribute and practice coding.

Python Mini Projects A collection of easy Python small projects to help you improve your programming skills. Table Of Contents Aim Of The Project Cont

Leah Nguyen 491 Jan 04, 2023
Цифрова збрoя проти xуйлoвської пропаганди.

Паляниця Цифрова зброя проти xуйлoвської пропаганди. Щоб негайно почати шкварити рашистські сайти – мерщій у швидкий старт! ⚡️ А коли ворожі сервери в

8 Mar 22, 2022
simple password manager.

simple password manager.

1 Nov 18, 2021
PressurePlate is a multi-agent environment that requires agents to cooperate during the traversal of a gridworld.

PressurePlate is a multi-agent environment that requires agents to cooperate during the traversal of a gridworld. The grid is partitioned into several rooms, and each room contains a plate and a clos

Autonomous Agents Research Group (University of Edinburgh) 6 Dec 03, 2022
Discovering local read-level DNA methylation patterns and DNA methylation heterogeneity in intermediately methylated regions

Discovering local read-level DNA methylation patterns and DNA methylation heterogeneity in intermediately methylated regions

1 Jan 11, 2022