PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Overview

PrimaryBid

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Part1

This part involves ingesting an application lifecycle raw data in .csv formats (“CC Application Lifecycle.csv”). The data is transformed to return various Application stages as column names, and the time of stage completion, as values against each customer ID via python.

Files included in this section include:

  • Solution Directory:
    • application_etl.py (Contains transformation class for application lifecycle raw data)
    • run_application_etl.py (Ingest and executes transformations for application lifecycle raw data)
  • Test Directory:
    • test_application_etl.py (runs a series of test for objects in the transformation class)
    • Input Directory (Contains all the input test files)
    • Output Directory (Contains all the output test files)

Execution:

  1. Execute run_application_etl.py to obtain output file for transformed application lifecycle data.

Modifications:

  1. Extra transformation, bug fixes and other modification can be added in application_etl.py as an object.
  2. For new transformations (new functions), add a test for the function in test_application_etl.py and execute it with pytest -vv.
  3. Call the object in run_application_etl.py after test passes to return desired output.

Part2

This part presents an architectural design to ingest data from a MongoDB database - into a Redshift data platform. The solution accomodates the addition of more data sources in the near future. The DDL scripts which form part of the solution is resusable for ingesting and loading data into redshift.

Files included in this section establishes the creation of target tables for the data ingestion process:

  • dwh.cfg (Infrastucture parameters and configuration)
  • DDL_queries.py (DDL queries to drop, creat, copy/insert data into Redshift)
  • table_setup_load.py (Class to manage the establish connection to database setup and teardown of tables in Redshift)
  • execute_ddl_process.py (script to execute processes in table_setup_load class)
  • test_execute_ddl_process.py (script to test the setup and teardown of resources.)
  • requirement.txt (key libraries needed to execute .py scripts)
  • makefile (file to automate process of installing and testing libraries and .py scripts respectively.)

Execution:

  1. Execute execute_ddl_process.py to create and load data into target tables from S3.

Modifications:

  1. Bucket file sources and other config paramters can be added in dwh.cfg
  2. New DDl queries which includes ingesting data from multiple tables from aggregations/joins can be added in DDL_queries.py.
  3. For other functions not captured in this section work, custom functions can be added in table_setup_load.py
  4. Before executing scripts for production environments, test the modifications by executing test_execute_ddl_process.py

The architecture below highlights the processes involved in ingesting data from various data sources into redshift

  • Architeture

Data Architecture

Owner
Emmanuel Boateng Sifah
Computer scientist, Doctoral researcher, Solutions engineer, Data scientist, Data analyst and Data engineer
Emmanuel Boateng Sifah
Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge S

1 Jan 09, 2022
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023
Analytical view of olist e-commerce in Brazil

Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this

Gurpreet Singh 1 Jan 11, 2022
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

tree-SNE t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in s

Isaac Robinson 61 Nov 21, 2022
pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

Tharsis Souza 5 Nov 19, 2022
Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

Intake 851 Jan 01, 2023
A simplified prototype for an as-built tracking database with API

Asbuilt_Trax A simplified prototype for an as-built tracking database with API The purpose of this project is to: Model a database that tracks constru

Ryan Pemberton 1 Jan 31, 2022
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022
X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Nguyễn Quang Huy 5 Sep 28, 2022
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

pandas 36.4k Jan 03, 2023
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c

Kevin Schwarzwald 42 Nov 09, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 09, 2021
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 07, 2023
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023