Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
An advanced crypto trading bot written in Python

Jesse Jesse is an advanced crypto trading framework which aims to simplify researching and defining trading strategies. Why Jesse? In short, Jesse is

Jesse 4.4k Jan 09, 2023
A Code that can make your Discord Account 24/7!

Online-Forever Make your Discord Account Online 24/7! A Code written in Python that helps you to keep your account 24/7. The main.py is the main file.

Phantom 556 Dec 29, 2022
in-progress decompilation of Gauntlet Legends decompression code on the N64

Gauntlet-Legends A in-progress decompilation of Gauntlet-Legends (N64) decompression code. This project currently supports the US release. Building (L

6 Jul 23, 2022
The program for obtaining a horoscope in Python using API from rapidapi.com site.

Python horoscope The program allows you to get a horoscope for your zodiac sign and immediately translate it into almost any language. Step 1 The firs

Architect 0 Dec 25, 2021
Zero2 Discord bot is written with Discord.py using Python.

Zero2 Discord bot is written with Discord.py using Python.

Siva Avanish 4 Nov 08, 2021
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.

PRAW: The Python Reddit API Wrapper PRAW, an acronym for "Python Reddit API Wrapper", is a Python package that allows for simple access to Reddit's AP

Python Reddit API Wrapper Development 3k Dec 29, 2022
A mass account list editor for python

Account-List-Editor This is an mass account list editor Usage Run the editor.py file with python (python3 ./editor.py) Press a button (1/2) and drag &

ExtremeDev 1 Dec 20, 2021
Migration Manager (MM) is a very small utility that can list source servers in a target account and apply mass launch template modifications.

Migration Manager Migration Manager (MM) is a very small utility that can list source servers in a target account and apply mass launch template modif

Cody 2 Nov 04, 2021
Discord bot for the IOTA Wiki

IOTA Wiki Bot Discord bot for the IOTA Wiki Report Bug · Request Feature About The Project This is a Discord bot for the IOTA Wiki. It's currently use

IOTA Community 2 Nov 14, 2021
A simple Discord Mass-Ban that's still working with Member Scraper.

Mass-Ban [!] This was made for education / you can use for revenge. Please don't skid it. [!] If you want to use it, please use member scraper before

WoahThatsHot 1 Nov 20, 2021
The best Discord bot, created for r/Jailbreak

Bloo Setup instructions These instructions assume you are on macOS or Linux. Windows users, good luck. With Docker (recommended!) You will need the fo

GIR 33 Dec 16, 2022
通过GitHub的actions 自动采集节点 生成订阅信息

VmessActions 通过GitHub的actions 自动采集节点 自动生成订阅信息 订阅内容自动更新再仓库的 clash.yml 和 v2ray.txt 中 然后PC端/手机端根据自己的软件支持的格式,订阅对应的链接即可

skywolf627 372 Jan 04, 2023
A tool for exporting Telegram group chats into static websites, preserving chat history like mailing list archives.

tg-archive is a tool for exporting Telegram group chats into static websites, preserving chat history like mailing list archives. Preview The @fossuni

Kailash Nadh 400 Dec 27, 2022
Irenedao-nft-generator - Original scripts used to generate IreneDAO NFTs

IreneDAO NFT Generator Scripts to generate IreneDAO NFT. Make sure you have Pill

libevm 60 Oct 27, 2022
Yes, it's true :purple_heart: This repository has 353 stars.

Yes, it's true! Inspired by a similar repository from @RealPeha, but implemented using a webhook on AWS Lambda and API Gateway, so it's serverless! If

510 Dec 28, 2022
SC4.0 - BEST EXPERIENCE · HEX EDITOR · Discord Nuker · Plugin Adder · Cheat Engine

smilecreator4 This site is for people who want to hack or want to learn it! Furthermore, this program does not work without turning off Antivirus or W

1 Jan 04, 2022
A simple bot discord in PY with moderation controls

Voila un bot discord en py avec les commandes simples de modération tout simplement faut changer les lignes 70 vous mettez votre token de votre bot 53

Ethan 1 Nov 20, 2021
EduuRobot Telegram bot source code.

EduuRobot A multipurpose Telegram Bot made with Pyrogram and asynchronous programming. Requirements Python 3.6+ An Unix-like operating system (Running

Amano Team 119 Dec 23, 2022
A python bot that will allow you to have maximum luck during Veve drops.

VeveBot You can follow me here Github | Twitter Features: - Click on the purchase at the time of the drop. - Be able to choose to do more than one tes

Rodz 1 Dec 04, 2021
🖥️ Windows Batch and powershell Discord Token grabber. Made for Troll (lmao)

Batched-Grabber Windows Batch and powershell Discord Token grabber. Made for Troll ! Setup. 1. pip(3) install numpy colored 2. python(3) Batched.py 3.

Ѵιcнч 41 Nov 01, 2022