Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Overview

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

AboutScenarioBase ConceptsPrerequisitesSet-upInstallationAirflow InterfacePipeline Task by TaskShut Down and Restart AirflowLearning Resources


About

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

An AWS s3 bucket is used as a Datalake in which json files are stored. The data is extracted from a json and parsed (cleaned). It is then transformed/processed with Spark (PySpark) and loaded/stored in either a Mongodb database or in an Amazon Redshift Data Warehouse.

The pipeline architecture - author's interpretation:

Note: Since this project was built for learning purposes and as an example, it functions only for a single scenario and data schema.

The project is built in Python and it has 2 main parts:

  1. The Airflow DAG file, dags/dagRun.py, which orchestrates the data pipeline tasks.
  2. The PySpark data transformation/processing script, located in sparkFiles/sparkProcess.py

Note: The code and especially the comments in the python files dags/dagRun.py and sparkFiles/sparkProcess.py are intentionally verbose for a better understanding of the functionality.

Scenario

The Romanian COVID-19 data, provided by https://datelazi.ro/, contains COVID-19 data for each county, including the total COVID numbers from one day to the next. It does not contain the difference in numbers between the days (i.e. for county X in day 1 there were 7 cases, in day 2 there were 37 cases). This data is loaded as a json file in the s3 bucket.

Find the differences between days for all counties (i.e. for county X there were 30 more cases in day 2 than in day 1). If the difference is smaller than 0 (e.g. because of a data recording error), then the difference for that day should be 0.

Base concepts

Prerequisites

Set-up

Download / pull the repo to your desired location.

You will have to create an AWS s3 user specifficaly for Airflow to interact with the s3 bucket. The credentials for that user will have to be saved in the s3 file found the directory /airflow-data/creds:

[airflow-spark1]
aws_access_key_id = 
aws_secret_access_key = 

On rows 17 and 18 in dags/dagRun.py you have the option to choose what databases system to use, mongoDB (noSQL) or Amazon Redshift (RDBMS), just by commenting/uncommenting one or the other:

# database = 'mongoDB'
database = 'Redshift'

If you want to use mongoDB, you will have to enter the mongoDB connection string (or environment variable or file with the string) in the dags/dagRun.py file, line 22:

client = pymongo.MongoClient('mongoDB_connection_string')

If you want to use a Redshift cluster, you will have to provide your Amazon Redshift database name, host and the rest of the credentials from row 29 to 34 in dags/dagRun.py:

dbname = 'testairflow'
host = '*******************************.eu-central-1.redshift.amazonaws.com'
port = '****'
user = '*********'
password = '********************'
awsIAMrole = 'arn:aws:iam::************:role/*******

You will have to change the s3 bucket name and file key (the name of the file saved in the s3 bucket) located at lines 148 and line 150 in dags/dagRun.py:

# name of the file in the AWS s3 bucket
key = 'countyData.json'
# name of the AWS s3 bucket
bucket = 'renato-airflow-raw'

In the repo directory, execute the following command that will create the .env file containig the Airflow UID and GID needed by docker-compose:

echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

Installation

Start the installation with:

docker-compose up -d

This command will pull and create Docker images and containers for Airflow, according to the instructions in the docker-compose.yml file:

After everything has been installed, you can check the status of your containers (if they are healthy) with:

docker ps

Note: it might take up to 30 seconds for the containers to have the healthy flag after starting.

Airflow Interface

You can now access the Airflow web interface by going to http://localhost:8080/. If you have not changed them in the docker-compose.yml file, the default user is airflow and password is airflow:

After signing in, the Airflow home page is the DAGs list page. Here you will see all your DAGs and the Airflow example DAGs, sorted alphabetically.

Any DAG python script saved in the directory dags/, will show up on the DAGs page (e.g. the first DAG, analyze_json_data, is the one built for this project).

Note: If you update the code in the python DAG script, the airflow DAGs page has to be refreshed

Note: If you do not want to see any Airflow example dags, se the AIRFLOW__CORE__LOAD_EXAMPLES: flag to False in the docker-compose.yml file before starting the installation.

Click on the name of the dag to open the DAG details page:

On the Graph View page you can see the dag running through each task (getLastProcessedDate, getDate, etc) after it has been unpaused and trigerred:

Pipeline Task by Task

Task getLastProcessedDate

Finds the last processed date in the mongo database and saves/pushes it in an Airflow XCom

Task getDate

Grabs the data saved in the XCom and depending of the value pulled, returns the task id parseJsonFile or the task id endRun

Task parseJsonFile

The json contains unnecessary data for this case, so it needs to be parsed to extract only the daily total numbers for each county.

If there is any new data to be processed (the date extracted in the task getLastProcessedDate is older than dates in the data) it is saved in a temp file in the directory sparkFiles:

i.e.: for the county AB, on the 7th of April, there were 1946 COVID cases, on the 8th of April there were 19150 cases

It also returns the task id endRun if there was no new data, or the task ID processParsedData

Task processParsedData

Executes the PySpark script sparkFiles/sparkProcess.py.

The parsed data is processed and the result is saved in another temporary file in the sparkFiles directory:

i.e.: for the county AB, on the 8th of April there were 104 more cases than on the 7th of April

Task saveToDB

Save the processed data either in the mongoDB database:

Or in Redshift:

Note: The Redshift column names are the full name of the counties as the short version for some of them conflicts with SQL reserved words

Task endRun

Dummy task used as the end of the pipeline

Shut Down and Restart Airflow

If you want to make changes to any of the configuration files docker-compose.yml, Dockerfile, requirements.txt you will have to shut down the Airflow instance with:

docker-compose down

This command will shut down and delete any containers created/used by Airflow.

For any changes made in the configuration files to be applied, you will have to rebuild the Airflow images with the command:

docker-compose build

Recreate all the containers with:

docker-compose up -d

Learning Resources

These are some useful learning resources for anyone interested in Airflow and Spark:

License

You can check out the full license here

This project is licensed under the terms of the MIT license.

Owner
Renato
Renato
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed

4 Aug 02, 2022
Python for Data Analysis, 2nd Edition

Python for Data Analysis, 2nd Edition Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Buy

Wes McKinney 18.6k Jan 08, 2023
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022
An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

Baidu 486 Dec 21, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Python reader for Linked Data in HDF5 files

Linked Data are becoming more popular for user-created metadata in HDF5 files.

The HDF Group 8 May 17, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.6k Dec 29, 2022
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper 🐱‍👤 is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

0 Dec 01, 2021
Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

Noppakorn Jiravaranun 5 Sep 27, 2021
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

Julian Mack 23 Dec 16, 2022