Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
A Python SDK for connecting devices to Microsoft Azure IoT services

V2 - We are now GA! This repository contains code for the Azure IoT SDKs for Python. This enables python developers to easily create IoT device soluti

Microsoft Azure 381 Dec 30, 2022
AWS Serverless Application Model (SAM) is an open-source framework for building serverless applications

AWS Serverless Application Model (AWS SAM) The AWS Serverless Application Model (SAM) is an open-source framework for building serverless applications

Amazon Web Services 8.9k Dec 31, 2022
Monitor your Binance portfolio

Binance Report Bot The intent of this bot is to take a snapshot of your binance wallet, e.g. the current balances and store it for further plotting. I

37 Oct 29, 2022
A continued fork of Disco

Orca Orca is an extensive and extendable Python 3.x library for the Discord API. orca boasts the following major features: Expressive, functional inte

RPS 4 Apr 03, 2022
Wrapper for vk_api lib for faster bot buliding

Welcome to VKBotPod repository! Wrapper for vk_api lib for faster bot buliding Features Simple syntax Rich functionality Special thanks to movpushmov

NullPointerException 3 Jan 14, 2022
Telegram bot to stream videos in telegram voicechat for both groups and channels

Telegram bot to stream videos in telegram voicechat for both groups and channels. Supports live streams, YouTube videos and telegram media. With record stream support, Schedule streams, and many more

ALBY 9 Feb 20, 2022
A Telegram bot to all media and documents files to web link .

FileStreamBot A Telegram bot to all media and documents files to web link . Report a Bug | Request Feature 🍁 About This Bot : This bot will give you

Code X Mania 129 Jan 03, 2023
Stack Overflow Error Parser

A python tool that executes python files and opens respective Stack Overflow threads in browser for errors encountered.

Raghavendra Khare 3 Jul 24, 2022
Palo Alto Networks PAN-OS SDK for Python

Palo Alto Networks PAN-OS SDK for Python The PAN-OS SDK for Python (pan-os-python) is a package to help interact with Palo Alto Networks devices (incl

Palo Alto Networks 281 Dec 09, 2022
SongBot2.0 With Python

SongBot2.0 Host 👨‍💻 Heroku 🚀 Manditary Vars BOT_TOKEN : Get It from @Botfather Special Feature Downloads Songs fastly and less errors as well as 0

Mr.Tanaji 5 Nov 19, 2021
Discord group chat spammer concept.

GC Spammer [Concept] GC-Spammer for https://discord.com/ Warning: This is purely a concept. In the past the script worked, however, Discord ratelimite

Roover 3 Feb 28, 2022
A fork of discord.py for anime enjoyers

A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Modern Pythonic API using async and await

Senpai Development 4 Nov 05, 2021
You cant check for conflicts until course enrolment actually opens. I wanted to do it earlier.

AcornICS I noticed that Acorn it does not let you check if a timetable is valid based on the enrollment cart, it also does not let you visualize it ea

Isidor Kaplan 2 Sep 16, 2021
List of twitch bots n bigots

This is a collection of bot account names NamelistMASTER contains all the names we reccomend you ban in your channel Sometimes people get on that list

62 Sep 05, 2021
A tool to customize your discord tokens

Fastest Discord Token Manager - Features: Change Token Username Change Token Password Change Token Avatar Change Token Bio This tool is created by Ace

trey 15 Dec 27, 2022
Seamlessly Connecting Notion Database with Python Pandas DataFrame

notion-df: Seamlessly Connecting Notion Database with Pandas DataFrame Please Note: This project is currently in pre-alpha stage. The code are not app

Shannon Shen 38 Dec 28, 2022
Visual Weather api. Returns beautiful pictures with the current weather.

VWapi Visual Weather api. Returns beautiful pictures with the current weather. Installation: sudo apt update -y && sudo apt upgrade -y sudo apt instal

Hotaru 33 Nov 13, 2022
A discord bot wrapper for python have slash command

A discord bot wrapper for python have slash command

4 Dec 04, 2021
Python script to replace BTC adresses in the clipboard with similar looking ones, whose private key can be retrieved by a netcat listener or similar.

BTCStealer Python script to replace BTC adresses in the clipboard with similar looking ones, whose private key can be retrieved by a netcat listener o

Some Person 6 Jun 07, 2022
ignorant allows you to check if a phone number is used on different sites like snapchat, instagram.

Ignorant For BTC Donations : 1FHDM49QfZX6pJmhjLE5tB2K6CaTLMZpXZ ignorant does not alert the target phone number ignorant allows you to check if a phon

Palenath 513 Dec 31, 2022