Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
A self-hosted Discord music bot.

Cassette A self-hosted Discord music bot. Requirements py-cord pynacl pytube Setup Intended to be hosted on Heroku. Fork or clone this repo. Create a

Lohan 8 Apr 28, 2022
M3U Playlist for free TV channels

Free TV This is an M3U playlist for free TV channels around the World. Either free locally (over the air): Or free on the Internet: Plex TV Pluto TV P

Free TV 964 Jan 08, 2023
A youtube videos or channels tag finder python module

A youtube videos or channels tag finder python module

Fayas Noushad 4 Dec 03, 2021
Ethereum transactions and wallet information for people you follow on Twitter.

ethFollowing Ethereum transactions and wallet information for people you follow on Twitter. Set up Setup python environment (requires python 3.8): vir

Brian Donohue 2 Dec 28, 2021
Clipboard-watcher - Keep an eye on the apps that are using your clipboard

clipboard-watcher This repository contains the code of an experiment, in order t

Gonçalo Valério 48 Oct 13, 2022
Cleaning Tiktok Hacks With Python

Cleaning Tiktok Hacks With Python

13 Jan 06, 2023
Versatile async-friendly library to retry failed operations with configurable backoff strategies

riprova riprova (meaning retry in Italian) is a small, general-purpose and versatile Python library that provides retry mechanisms with multiple backo

Tom 108 Apr 27, 2022
Python functions to run WASS stereo wave processing executables, and load and post process WASS output files.

wass-pyfuns Python functions to run the WASS stereo wave processing executables, and load and post process the WASS output files. General WASS (Waves

Mika Malila 3 May 13, 2022
A basic API to scrape Craigslist.

CLAPI A basic API to scrape Craigslist. Most useful for viewing posts across a broad geographic area or for viewing posts within a specific timeframe.

45 Jan 05, 2023
The Main Pythonic Version Of Twig Using Nextcord

The Main Pythonic Version Of Twig Using Nextcord

8 Mar 21, 2022
Gera um PDF, logo depois de você responder um questionário simples, e envia para o e-mail que você informar.

PDF generator and send it for your email Criador: Francisco Robson de O. Dutra Filho Repositório criado no dia 18/09/2021 Instagram: @robsondutra_ Sob

8 Nov 22, 2021
Unfollows Users You're Following

Github-Unfollow-Bot Info It unfollows users you're following, it runs in the background so you can still do what you do without it bothering you. It's

ExT 4 Sep 03, 2022
Unencrypted Story View Botter is a helpful tool that allows thousands of people to watch your posts.

Unencrypted Story View Botter is a helpful tool that allows thousands of people to watch your posts.

8 Aug 05, 2022
BioThings API framework - Making high-performance API for biological annotation data

BioThings SDK Quick Summary BioThings SDK provides a Python-based toolkit to build high-performance data APIs (or web services) from a single data sou

BioThings 39 Jan 04, 2023
Twitter bot code can be found in twitterBotAPI.py

NN Twitter Bot This github repository is BASED and is yanderedev levels of spaghetti Neural net code can be found in alexnet.py. Despite the name, it

167 Dec 19, 2022
A fully responsive interface to manage all your favorite software on your HTPC.

Python 3 port of Hellowlol's HTPC Manager fork We made this an organization repository to be more independent from single developers. If you want to j

26 Jan 04, 2023
Configure your linux server and check for vulnerabilities with serverlla

serverlla Configure your linux server and check for vulnerabilities with serverlla. Serverlla has a menu with options and allows you to configure your

Dylan Meca 10 Feb 01, 2022
A telegram bot writen in python for mirroring files on the internet to our beloved Google Drive

[] Mirror Bot This is a telegram bot writen in python for mirroring files on the internet to our beloved Google Drive. Deploying on Heroku Give Star &

43 Mar 06, 2022
Backend.AI Client Library for Python

Backend.AI Client The official API client library for Backend.AI Usage (KeyPair mode) You should set the access key and secret key as environment vari

Lablup 10 Feb 10, 2022
Webservice that notifies users on Slack when a change in GitLab concern them.

Gitlab Slack Notifier Webservice that notifies users on Slack when a change in GitLab concern them. Setup Slack Create a Slack app, go to "OAuth & Per

Heuritech 2 Nov 04, 2021