A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
Learning objective: Use React.js, Axios, and CSS to build a responsive YouTube clone app

Learning objective: Use React.js, Axios, and CSS to build a responsive YouTube clone app to search for YouTube videos, channels, playlists, and live events via wrapper around Google YouTube API.

Dillon 0 May 03, 2022
Wordle Solver

Wordle Solver Installation Install the following onto your computer: Python 3.10.x Download Page Run pip install -r requirements.txt Instructions To r

John Bucknam 1 Feb 15, 2022
Hera is a Python framework for constructing and submitting Argo Workflows.

Hera is an Argo Workflows Python SDK. Hera aims to make workflow construction and submission easy and accessible to everyone! Hera abstracts away workflow setup details while still maintaining a cons

argoproj-labs 241 Jan 02, 2023
Performance data for WASM SIMD instructions.

WASM SIMD Data This repository contains code and data which can be used to generate a JSON file containing information about the WASM SIMD proposal. F

Evan Nemerson 5 Jul 24, 2022
This project intends to take the user's CEP (brazilian adress code) and return the local in which the CEP is placed.

This project aims to simply return the CEP's (the brazilian resident adress code) User of the application. The project uses a request and passes on to

Daniel Soares Saldanha 4 Nov 17, 2021
Load, explore and analyse data from Scotland and rest of the world related to Covid19.

Streamlit Examples This is my first attempt with Streamlit. It is an open-source framework, free, Python-based and easy to use tool to build and deplo

Eyad Elyan 12 Mar 01, 2021
Virtual webcam that takes real webcam footage and replaces the background in order to have Virtual Backgrounds in MS Teams for Linux where the feature is unimplemented.

Background Remover The Need It's been good long while since Microsoft first released a Teams version for Linux and yet, one of Teams' coolest features

Dylan Turner 80 Dec 20, 2022
Intelligent Employer Profiling Platform.

Intelligent Employer Profiling Platform Setup Instructions Generating Model Data Ensure that Python 3.9+ and pip is installed. Install project depende

Harvey Donnelly 2 Jan 09, 2022
GUI tool to manage the contents of chests in Botw

Botw chest manager is a small gui tool allowing to easily manage chests. Sometimes Ice Spear can be very time consuming when adding a simple chest. The purpose of this light tool is to add a new ches

3 Aug 25, 2022
Addons like multipages for streamlit webapp

streamlit_pages Installation $ pip install streamlit-pages Features Adding multiple pages to streamlit Sharing specific pages Usage import streamlit

36 Dec 25, 2022
BlackIP-Rep is a tool designed to gather the reputation and information of Bulk IP's.

BlackIP-Rep is a tool designed to gather the reputation and information of Bulk IP's. Focused on increasing the workflow of Security Operations(SOC) team during investigation.

0LiVEr 6 Dec 12, 2022
Just some mtk tool for exploitation, reading/writing flash and doing crazy stuff

Just some mtk tool for exploitation, reading/writing flash and doing crazy stuff. For linux, a patched kernel is needed (see Setup folder) (except for read/write flash). For windows, you need to inst

Bjoern Kerler 1.1k Dec 31, 2022
A small scale relica of bank management system using the MySQL queries in the python language.

Bank_Management_system This is a Bank Management System Database Project. Abstract: The main aim of the Bank Management Mini project is to keep record

Arun Singh Babal 1 Jan 27, 2022
Library to emulate the Sneakers movie effect

py-sneakers Port to python of the libnms C library To recreate the famous data decryption effect shown in the 1992 film Sneakers. Install pip install

Nicolas Rebagliati 11 Aug 27, 2021
Jarvis Python BOT acts like Google-assistance

Jarvis-Python-BOT Jarvis Python BOT acts like Google-assistance Setup Add Mail ID (Gmail) in the file at line no 82.

Ishan Jogalekar 1 Jan 08, 2022
A good Tool to comment on xmw

A good Tool to comment on xmw

1 Feb 10, 2022
Path of Exile Vendor Recipe Tracker (Chaos/Regal orb)

Path of Exile Vendor Trade Tracker Are you tired of manually keeping track of collected and missing items for farming Chaos or Regal Orbs in PoE? Me t

1 Nov 09, 2021
Python PID Controller and Process Simulator (FOPDT) with GUI.

PythonPID_Simulator Python PID Controller and Process Simulator (FOPDT) with GUI. Run the File. Then select Model Values and Tune PID.. Hit Refresh to

19 Oct 14, 2022
Python tools for working with Orbit Ephemeris Messages (OEMs).

Python Orbit Ephemeris Message tools Python tools for working with Orbit Ephemeris Messages (OEMs). Development Status Installation The oem package is

Brad Sease 4 Apr 06, 2022
Hashcrack - A non-object oriented open source, Software for Windows/Linux made in Python 3

Multi Force This project is a non-object oriented open source, Software for Wind

Radiationbolt 3 Jan 02, 2023