A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
fair-test is a library to build and deploy FAIR metrics tests APIs supporting the specifications used by the FAIRMetrics working group.

☑️ FAIR test fair-test is a library to build and deploy FAIR metrics tests APIs supporting the specifications used by the FAIRMetrics working group. I

Maastricht University IDS 6 Oct 30, 2022
API de mi aplicación de Biblioteca

BOOKSTORE API Instalación/Configuración Previo Es una buena idea crear un entorno virtual antes de instalar las dependencias. Puedes hacerlo con el si

Gabriel Morales 1 Jan 09, 2022
Awslogs - AWS CloudWatch logs for Humans™

awslogs awslogs is a simple command line tool for querying groups, streams and events from Amazon CloudWatch logs. One of the most powerful features i

Jorge Bastida 4.5k Dec 30, 2022
Discord Crypto Payment Cards Selfbot

A Discord selfbot that serves the purpose of displaying text and QR versions of your BTC, LTC & ETH payment information for easy and simple commercial or personal transactions.

2 Apr 12, 2022
Der Dischkort Bot für Andiismus

AndreOS Der Dischkort Bot für Andiismus Wichtigger Bot für den hauseigenen Discord-Server Indoktrinationsmechanismusleitungsprogramm der andiistischen

Leon Bartle 3 Jan 13, 2022
Python bindings for ArrayFire: A general purpose GPU library.

ArrayFire Python Bindings ArrayFire is a high performance library for parallel computing with an easy-to-use API. It enables users to write scientific

ArrayFire 402 Dec 20, 2022
✨ A Telegram mirror/leech bot By SparkXcloud Group ✨

SparkXcloud-Gdrive-MirrorBot SparkXcloud-Gdrive-MirrorBot is a multipurpose Telegram Bot writen in Python for mirroring files on the Internet to our b

119 Oct 23, 2022
Powerful and Advance Telegram Bot with soo many features😋🔥❤

Chat-Bot Reach this bot on Telegram Chat Bot New Features 🔥 ✨ Improved Chat Experience ✨ Removed Some Unnecessary Commands ✨ Added Facility to downlo

Sanila Ranatunga 10 Oct 21, 2022
Python wrapper for JeyyAPI

Async python wrapper for JeyyAPI

7 Dec 10, 2022
A discord token nuker With loads of options that will screw an account up real bad, also has inbuilt massreport, GroupChat Spammer and Token/Password/Creditcard grabber and so much more!

Installation | Important | Changelogs | Discord NOTE: Hazard is not finished! You can expect bugs, crashes, and non-working functions. Please make an

Rdimo 470 Aug 09, 2022
Python wrapper for Gmailnator

Python wrapper for Gmailnator

h0nda 11 Mar 19, 2022
Easy to use API Wrapper for somerandomapi.ml.

Overview somerandomapi is an API Wrapper for some-random-api.ml Examples Asynchronous from somerandomapi import Animal

Myxi 1 Dec 31, 2021
Cdk-python-crud-app - CDK Python CRUD App

Welcome to your CDK Python project! You should explore the contents of this proj

Shapon Sheikh 1 Jan 12, 2022
A lightweight Python wrapper for the IG Markets API

trading_ig A lightweight Python wrapper for the IG Markets API. Simplifies access to the IG REST and Streaming APIs with a live or demo account. What

IG Python 247 Dec 08, 2022
Build a better understanding of your data in PostgreSQL.

Data Fluent for PostgreSQL Build a better understanding of your data in PostgreSQL. The following shows an example report generated by this tool. It g

Mark Litwintschik 28 Aug 30, 2022
Wechat based auto reply with pyautogui

Python-微信 自动回复 练手~ 一直想做个给微信发个消息,就可以跑Python程序,并将结果发送给我的东西,之前看了 B站@不高兴就喝水 的视频,终于有了灵感~ 使用的是模拟点击方案,请求期间是不能操作了。 库 pyautogui 用于模拟鼠标键盘操作和定位操作位置 pyperclip 剪贴板

Vito Song 1 Oct 22, 2022
Repositorio dedicado a contener los archivos fuentes del bot de discord "Lector de Ejercicios".

Lector de Ejercicios Este bot de discord está pensado para usarse únicamente en el discord de la materia Algoritmos y Programación I, de la Facultad d

Franco Lighterman Reismann 3 Sep 17, 2022
Free and Open Source Machine Translation API. 100% self-hosted, no limits, no ties to proprietary services. Built on top of Argos Translate.

LibreTranslate Try it online! | API Docs Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it doesn't rely on pro

UAV4GEO 3.5k Jan 03, 2023
Cloudkeeper is “housekeeping for clouds” - find leaky resources, manage quota limits, detect drift and clean up.

Cloudkeeper Housekeeping for Clouds! Table of contents Overview Docker based quick start Cloning this repository Component list Contact License Overvi

Some Engineering 1.2k Jan 03, 2023
Easy to use phishing tool with 63 website templates. Author is not responsible for any misuse.

PyPhisher [+] Created By KasRoudra [+] Description : Ultimate phishing tool in python. Includes popular websites like facebook, twitter, instagram, gi

KasRoudra 1.1k Jan 01, 2023