A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
IOGen - An Open source discord token generator

_____ ____ _____ |_ _/ __ \ / ____| | || | | | |

0xVichy#1234 85 Nov 03, 2022
A simple Discord bot that can fetch definitions and post them in chat.

A simple Discord bot that can fetch definitions and post them in chat. If you are connected to a voice channel, the bot will also read out the definition to you.

Tycho Bellers 4 Sep 29, 2022
Python bot for send videos of a Youtube channel to a telegram group , channel or chat

py_youtube_to_telegram Usage: If you want to install ytt and use it, run this command: sudo sh -c "$(curl -fsSL https://raw.githubusercontent.com/nima

Nima Fanniasl 8 Nov 22, 2022
Display relevant information for the amazing Banano coin.

Display relevant information for the amazing Banano coin. It'll also show your current [email 

Ron Talman 4 Aug 14, 2022
A simple terminal UI for viewing fund P/L analysis through TEFAS

Tefas UI A simple terminal UI for viewing fund P/L analysis through TEFAS. Features (that my own bank's UI lack): Daily and weekly P/L FX comparisons

Batuhan Taskaya 4 Mar 14, 2022
Savecontentbot - Telegram Save Content Bot With Same more Features

Save Restricted Content Bot A simple telegram bot to save restricted content wit

Group Dc Bots 3 Jan 26, 2022
A simple but useful Discord Selfbot with essential features, made with discord.py-self.

Discord Selfbot Xyno Discord Selfbot Xyno is a simple but useful selfbot for Discord. It has currently limited useful features but it will be updated

Amit Pathak 7 Apr 24, 2022
Home Assistant custom integration for controlling Powered by Tuya (PBT) devices using Tuya Open API, officially maintained by the Tuya Developer Team.

Tuya Home Assistant Integration Home Assistant custom integration for controlling Powered by Tuya (PBT) devices using Tuya Open API, officially mainta

Tuya 704 Jan 03, 2023
Uses discords api to see if a token has a valid payment method.

Discord Payment Checker Uses discords api to see if a token has a valid payment method. Report Bug · Request Feature Features Checks tokens Checks all

dropout 10 Dec 01, 2022
Wedding website for July 2022.

Capstone Project: a real wedding website! User Stories A user should be able to signup for the website A user should be able to login to the website i

1 Nov 04, 2021
Benachrichtigungs-Bot für das niedersächische Impfportal / Notification bot for the lower saxony vaccination portal

Ein kleines Wochenend-Projekt von mir. Der Bot überwacht die REST-API des niedersächsischen Impfportals auf freie Impfslots und sendet eine Benachrichtigung mit deinem bevorzugtem Service. Ab da gilt

sibalzer 37 May 11, 2022
An advanced telegram country information finder bot.

Country-Info-Bot-V2 An advanced telegram country information finder bot Made with Python3 (C) @FayasNoushad Copyright permission under MIT License Lic

Fayas Noushad 16 Nov 12, 2022
Bootstrapping your personal Web3 info hub from more than 500 RSS Feeds.

RSS Aggregator for Web3 (or 🥩 RAW for short) Bootstrapping your personal Web3 info hub from more than 500 RSS Feeds. What is RSS or Reader Services?

ChainFeeds 1.8k Dec 29, 2022
Reads and prints information from the website MalAPI.io

MalAPIReader Reads and prints information from the website MalAPI.io optional arguments:

Squiblydoo 16 Nov 10, 2022
Financial portfolio optimisation in python, including classical efficient frontier, Black-Litterman, Hierarchical Risk Parity

PyPortfolioOpt has recently been published in the Journal of Open Source Software 🎉 PyPortfolioOpt is a library that implements portfolio optimizatio

Robert Martin 3.2k Jan 02, 2023
Auto Filter Bot V2 With Python

How To Deploy Video Subscribe YouTube Channel Added Features Imdb posters for autofilter. Imdb rating for autofilter. Custom captions for your files.

Milas 2 Mar 25, 2022
The best discord.py template with a changeable prefix

Discord.py Bot Template By noma4321#0035 With A Custom Prefix To Every Guild Function Features Has a custom prefix that is changeable for every guild

Noma4321 5 Nov 24, 2022
A telegram bot providing recon and research functions for bug bounty research

Bug Bounty Bot A telegram bot with commands to simplify bug bounty tasks Installation Use Road Map Installation BugBountyBot is open-source so you can

Tyler Butler 1 Oct 23, 2021
Bot interpretation of the carbon.now.sh site

📒 Source code of the @PicodeBot 🧸 Developer: @hoosnick Run $ git clone https://github.com/hoosnick/picodebot.git $ pip install -r requirements.txt P

Husniddin Murodov 13 Oct 02, 2022
Minimal API for the COVID Booking System of the Offices at the UniPD Math Dep

Simple and easy to use python BOT for the COVID registration booking system of the math department @ unipd (torre archimede). This API creates an interface with the official website, with more useful

Guglielmo Camporese 4 Dec 24, 2021