In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Overview

ETL Pipeline for AWS

Project Description

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 to stagging tables on Redshift and SQL queries are written to create analytics tables from staging tables.

Dataset Structure

The dataset is composed of two files the Songs data and Logs data that is present in S3 bucket.

Song Data

The song data is dataset with million of entries. Each file is in JSON format that contains the data about song, artist of that song. Moreover, the files are partitioned by the first three letters of song ID. The single entry of the song dataset looks like

  • {
       "num_songs":1,
       "artist_id":"ARJIE2Y1187B994AB7",
       "artist_latitude":null,
       "artist_longitude":null,
       "artist_location":"",
       "artist_name":"Line Renaud",
       "song_id":"SOUPIRU12A6D4FA1E1",
       "title":"Der Kleine Dompfaff",
       "duration":152.92036,
       "year":0
    }
    

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

Logs Data

The logs dataset is also in the JSON formatted, which is formed by the event simulator based on the songs dataset. The logs dataset is the activity logs from the music app.

  • {
        "artist": "Pavement",
        "auth": "Logged in",
        "firstName": "Sylvie",
        "gender": "F",
        "iteminSession": 0,
        "lastName": "Cruz",
        "length": 99.16036,
        "level": "free",
        "location": "Kiamath Falls, OR",
        "method": "PUT",
        "page": "NextSong",
        "registration": 1.540266e+12,
        "sessionId": 345,
        "song": "Mercy: The Laundromat",
        "status": 200,
        "ts": 1541990258796,
        "userAgent": "Mozzilla/5.0...",
        "userId": 10
    }
    

Data Warehouse schema

There are two staging tables: Event table: artist VARCHAR, auth VARCHAR, firstName VARCHAR, gender VARCHAR, itemInSession INT, lastName VARCHAR, length DOUBLE PRECISION, level VARCHAR, location VARCHAR, method VARCHAR , page VARCHAR, registration VARCHAR, sessionid INT, song VARCHAR, status INT, ts VARCHAR, userAgent VARCHAR, userId INT*

Song table* num_songs INTEGER,* artist_id VARCHAR, artist_latitude VARCHAR, artist_longitude VARCHAR, artist_location VARCHAR , artist_name VARCHAR, song_id VARCHAR, title VARCHAR, duration NUMERIC NOT NULL, year integer*

These staging tables helps forming dimension tables and fact tables:

Dimension Tables:
users:
*user_id, first_name, last_name, gender, level*
songs:
*song_id, title, artist_id, year, duration*
artists:
*artist_id, name, location, latitude, longitude*
time:
*start_time, hour, day, week, month, year, weekday*
Fact tables:
Songplays:
*songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*

All the tables contains Primary Key as there should be something unique to identify the rows in the table.

ETL Process

The ETL process is comprises of two steps:

  • Getting data from S3 bucket to staging table
  • Insert the data in dimension and fact table from staging tables using Star Schema

Files Description

- create_tables.py: When create_tables.py run, it will first create tables and drop if table already exists. 
- etl.py: read and process data files
- dwh.cfg: File contains the data warehouse settings for AWS. It contains CLUSTER, IAM_ROLE and S3 settings for the ETL pipeline
- sql_queries: Contains the sql queries for dropping, creation, selection data from tables.
Owner
Mobeen Ahmed
Mobeen Ahmed
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 09, 2021
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 24 Dec 14, 2022
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

lushi_script Introduction This script is to "SHUA" H1-2 map of Mercenaries mode of Hearthstone Installation Make sure you installed python=3.6. To in

210 Jan 02, 2023
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

BioMASS 22 Dec 27, 2022
The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias Introduction | Updates | Usage | Results&Pretrained Models | Statement | Intr

104 Nov 27, 2022
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
A DSL for data-driven computational pipelines

"Dataflow variables are spectacularly expressive in concurrent programming" Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum Quick overview Ne

1.9k Jan 03, 2023
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
Big Data & Cloud Computing for Oceanography

DS2 Class 2022, Big Data & Cloud Computing for Oceanography Home of the 2022 ISblue Big Data & Cloud Computing for Oceanography class (IMT-A, ENSTA, I

Ocean's Big Data Mining 5 Mar 19, 2022
Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

Coiled 102 Nov 10, 2022
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022