A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
A program that automates the boring parts of completing the Daily accounting spreadsheet at Taos Ski Valley

TSV_Daily_App A program that automates the boring parts of completing the Daily accounting spreadsheet at my old job. To see how it works you will nee

Devin Beck 2 Jan 01, 2022
Telegram Reporter

[Telegram Reporter v.3 ] 🇮🇷 AliCybeRR 🇮🇷 [ AliCybeRR.Reporter feature ] Login Your Telegram account 👽 support Termux ❕ No Limits ⚡ Secure 🔐 Free

AliCybeRR 1 Jun 08, 2022
❤️A next gen powerful telegram group manager bot for manage your groups and have fun with other cool modules

Natsuki Based on Python Telegram Bot Contributors Video Tutorial: Complete guide on deploying @TheNatsukiBot's clone on Heroku. ☆ Video by Sadew Jayas

Pawan Theekshana 8 Oct 06, 2022
Orca is an extensive and extendable Python 3.x library for the Discord API.

Orca is an extensive and extendable Python 3.x library for the Discord API.

RPS 4 Apr 03, 2022
Plazmix API wrapper for Python

An optimised, easy to use Plazmix API wrapper written in Python

Someone 2 Nov 16, 2021
Получение интересной информации о любой пиццерии Додо

dodopizza-abuse Получение инфорации о выбранной пиццерии Додо Установка и запуск на Linux Устанавливаем git и python: apt-get update && apt-get -y ins

Хозя 24 Nov 02, 2022
Auto like & auto followers facebook

Auto like & auto followers facebook

Fahmi Dev 23 Dec 08, 2022
SkyzoMusicBot - Bot Music Telegram By Skyzo

SKYZO MUSIC BOT Telegram Music Bot And Stream Feature New Version Ready to use m

Skyzo 19 Apr 08, 2022
Modular Telegram bot running on Python

Modular Telegram bot running on Python

Jefanya Efandchris 1 Dec 26, 2021
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

ClearML 3.9k Jan 01, 2023
A pixeldrain python package using pixeldrain official api

Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/FayasNoushad/Pixeldrain/blob/main/LICENSE In

Fayas Noushad 6 Jan 26, 2022
Personal Discord Python Bot based on Discord.py

Personal Discord bot using the discord.py library by Rapptz

2 Dec 14, 2022
A modular Telegram group management bot running with Python based on Pyrogram.

A modular Telegram group management bot running with Python based on Pyrogram.

Jefanya Efandchris 1 Nov 14, 2022
discord voice bot to stream radio

Radio-Id Bot (Discord Voice Bot) Radio-id-bot (Radio Indonesia) is a simple Discord Music Bot built with discord.py to play a radio from some Indonesi

Adi Fahmi 20 Sep 20, 2022
A small and fun Discord Bot that is written in Python and discord-interactions (with discord.py)

Articuno (discord-interactions) A small and fun Discord Bot that is written in Python and discord-interactions (with discord.py) Get started If you wa

Blue 8 Dec 26, 2022
A Telegram Repo For Devs To Controll The Bots Under Maintenance.This Bot Is For Developers, If Your Bot Is Down, Use This Repo To Give Your Dear Subscribers Some Support By Providing Them Response.

Maintenance Bot A Telegram Repo For Devs To Controll The Bots Under Maintenance About This Bot This Bot Is For Developers, If Your Bot Is Down, Use Th

Vɪᴠᴇᴋ 47 Dec 29, 2022
A python crypto trading bot on Binance using RSI in 25 Lines 🚀

RSI Crypto Trading Bot - Binance A Crypto Trading Bot on Binance trading BTCUSDT and ETHUSDT using RSI in 25 Lines of Code Getting Started Note Python

Blankly Finance 10 Dec 26, 2022
• Create Your Own YouTube Info Api.

youtube_data_api • Create Your Own YouTube Info Api. Deploy How to Use https://{ Heroku App Name }.herokuapp.com/api?link={YouTube link} In local Host

lokaman chendekar 12 Oct 02, 2022
A simple waybar module to display the status of the ICE you are currently in using the ICE Portals JSON API.

waybar-iceportal A simple waybar module to display the status of the ICE you are currently in using the ICE Portals JSON API. Installation Ensure pyth

Moritz 7 Aug 26, 2022
The successor of GeoSnipe, a pythonic Minecraft username sniper based on AsyncIO.

OneSnipe The successor of GeoSnipe, a pythonic Minecraft username sniper based on AsyncIO. Documentation View Documentation Features • Mojang & Micros

1 Jan 14, 2022