An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Last update: Feb 11, 2022

Related tags

Overview

Data Warehouse on AWS Redshift

ETL Pipeline in AWS Redshift and S3

Project Summary

In this project, I have built an ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

I then query the staged data into analytics tables. This will help Sparkify's analytics team get quicker insights about its customer base.

File Descriptions

create_tables.py

create fact and dimension tables for the star schema in Redshift.

sql_queries.py

define SQL statements, which will then be imported into the other files.

etl.py

load data from S3 into staging tables on Redshift, and then process that data into analytics tables on Redshift.

Design Decisions

Keyspace Star Schema

The star schema is used, with a fact table centered around dimension tables at its periphery.

Fact table: songplays -- every occurrence of a song being played is stored here.

Dimension tables:

users -- the users of the Sparkify music streaming app
songs -- the songs in Sparkify's music catalog
artists -- the artists who record the catalog's songs
time -- the timestamps of records in songplays, broken down into specific date and time units (year, day, hour, etc.)

Run Instructions

Clone this repository, which will place the 3 .py files and the .cfg file into the same directory.
Duplicate the dwh_template.cfg file to create a new file named dwh.cfg. Because this will contain private login credentials, be sure it is added to the .gitignore file.
Fill in the [CLUSTER] and [IAM_ROLE] attributes from AWS, according to the IAM role and Redshift cluster already created. Please consult AWS's well-documented instructions as necessary.
Run python create_tables.py to set up the Redshift data warehouse cluster.
Run python etl.py. This will copy the 2 large tables from S3 into staging tables. After that, this will also populate the smaller dimension tables.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Related tags

Overview

Data Warehouse on AWS Redshift

Project Summary

File Descriptions

create_tables.py

sql_queries.py

etl.py

Design Decisions

Keyspace Star Schema

Run Instructions

Owner

Fast, flexible and easy to use probabilistic modelling in Python.

Desafio 1 ~ Bantotal

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

A Python package for the mathematical modeling of infectious diseases via compartmental models

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Python reader for Linked Data in HDF5 files

Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Collections of pydantic models

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Create HTML profiling reports from pandas DataFrame objects

Template for a Dataflow Flex Template in Python

ped-crash-techvol: Texas Ped Crash Tech Volume Pack

Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation