The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards

Overview

Project Scopes

The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards. To do that, a conceptual data model and a data pipeline will be defined.

Architecture

Data are uploaded to Google Cloud Storage bucket. GCS will act as the data lake where all raw files are stored. Data will then be loaded to staging tables on BigQuery. The ETL process will take data from those staging tables and create data mart tables. An Airflow instance can be deployed on a Google Compute Engine or locally to orchestrate the pipeline.

Here are the justifications for the technologies used:

  • Google Cloud Storage: act as the data lake, vertically scalable.
  • Google Big Query: act as data base engine for data warehousing, data mart and ETL processes. BigQuery is a serverless solution that can easily and effectively process petabytes scale dataset.
  • Apache Airflow: orchestrate the workflow by issuing command line to load data to BigQuery or SQL queries for ETL process. Airflow does not have to process any data by itself, thus allowing the architecture to scale.

Data Model

The database is designed following a star-schema principal with 1 fact table and 5 dimensions tables.

image

  • F_IMMIGRATION_DATA: contains immigration information such as arrival date, departure date, visa type, gender, country of origin, etc.
  • D_TIME: contains dimensions for date column
  • D_PORT: contains port_id and port_name
  • D_AIRPORT: contains airports within a state
  • D_STATE: contains state_id and state_name
  • D_COUNTRY: contains country_id and country_name
  • D_WEATHER: contains average weather for a state
  • D_CITY_DEMO: contains demographic information for a city

Data pipeline

This project uses Airflow for orchestration.

image

A DummyOperator start_pipeline kick off the pipeline followed by 4 load operations. Those operations load data from GCS bucket to BigQuery tables. The immigration_data is loaded as parquet files while the others are csv formatted. There are operations to check rows after loading to BigQuery.

Next the pipeline loads 3 master data object from the I94 Data dictionary. Then the F_IMMIGRATION_DATA table is created and check to make sure that there is no duplicates. Other dimension tables are also created and the pipelines finishes.

Scenarios

Data increase by 100x

Currently infrastructure can easily supports 100x increase in data size. GCS and BigQuery can handle petabytes scale data. Airflow is not a bottle neck since it only issue commands to other services.

Pipelines would be run on 7am daily. how to update dashboard? would it still work?

Schedule dag to be run daily at 7 AM. Setup dag retry, email/slack notification on failures.

Make it available to 100+ people

BigQuery is auto-scaling so if 100+ people need to access, it can handle that easily. If more people or services need access to the database, we can add steps to write to a NoSQL database like Data Store or Cassandra, or write to a SQL one that supports horizontal scaling like BigTable.

Project Instructions

GCP setup

Follow the following steps:

  • Create a project on GCP
  • Enable billing by adding a credit card (you have free credits worth $300)
  • Navigate to IAM and create a service account
  • Grant the account project owner. It is convenient for this project, but not recommended for production system. You should keep your key somewhere safe.

Create a bucket on your project and upload the data with the following structure:

gs://cloud-data-lake-gcp/airports/:
gs://cloud-data-lake-gcp/airports/airport-codes_csv.csv
gs://cloud-data-lake-gcp/airports/airport_codes.json

gs://cloud-data-lake-gcp/cities/:
gs://cloud-data-lake-gcp/cities/us-cities-demographics.csv
gs://cloud-data-lake-gcp/cities/us_cities_demo.json

gs://cloud-data-lake-gcp/immigration_data/:
gs://cloud-data-lake-gcp/immigration_data/part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00001-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00002-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00003-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00004-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00005-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00006-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00007-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00008-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00009-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00010-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00011-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00012-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00013-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet

gs://cloud-data-lake-gcp/master_data/:
gs://cloud-data-lake-gcp/master_data/
gs://cloud-data-lake-gcp/master_data/I94ADDR.csv
gs://cloud-data-lake-gcp/master_data/I94CIT_I94RES.csv
gs://cloud-data-lake-gcp/master_data/I94PORT.csv

gs://cloud-data-lake-gcp/weather/:
gs://cloud-data-lake-gcp/weather/GlobalLandTemperaturesByCity.csv
gs://cloud-data-lake-gcp/weather/temperature_by_city.json

You can copy the data to your own bucket by running the following:

gsutil cp -r gs://cloud-data-lake-gcp/ gs://{your_bucket_name}

Local setup

Clone the project, create environment, install required packages by running the following:

Install docker if it's not already installed. You can find the resources to do that here.

Install the Astronomer CLI following the instructions here.

Run the following commands to bring up the Airflow instance:

astro d start

You can look at the logs by running make logs if you need to debug something. You can access and manage the pipeline by typing the following address to a browser:

localhost:8080/admin/

If everything is setup correctly, you will see the following screen:

image

Navigate to Admin -> Connections and paste in the credentials for the following two connections: bigquery_default and google_cloud_default

image

Navigate to the main dag on path dags\cloud-data-lake-pipeline.py and change the following parameters with your own setup:

project_id = 'cloud-data-lake'
staging_dataset = 'IMMIGRATION_DWH_STAGING'
dwh_dataset = 'IMMIGRATION_DWH'
gs_bucket = 'cloud-data-lake-gcp'

You can then trigger the dag and the pipeline will run.

The data warehouse

The final data warehouse looks like this: img

Owner
Shweta_kumawat
AI software @ Computer programmer, Deep Learning, Computer Vision Researcher and Developer. Implement AI products and machine learning solutions
Shweta_kumawat
This is an Advanced Calculator maybe with Discord Buttons in python.

Welcome! This is an Advanced Calculator maybe with Discord Buttons in python. This was the first version of the calculator, made for my discord bot, P

Polsulpicien 18 Dec 24, 2022
A python package to fetch results of various national examinations done in Tanzania.

Necta-API Get a formated data of examination results scrapped from necta results website. Note this is not an official NECTA API and is still in devel

vincent laizer 16 Dec 23, 2022
Example code for interacting with solana anchor programs - candymachine

candypy example code for interacting with solana anchor programs - candymachine THIS IS PURELY SAMPLE CODE TO FORK, MODIFY, UNDERSTAND AND INTERACT WI

dubbelosix 3 Sep 18, 2022
Plays air warning sound when detects a certain phrase or a word in a specified Telegram chat.

Tryvoha Bot Disclaimer: this is more a convenient naming, rather than a real bot. It is designed to play air warning sound when detects a certain phra

Dmytro Novikov 2 Mar 02, 2022
Faster Twitch Alerts is a highly customizable, lightning-fast alternative to Twitch's slow mobile notification system

Faster Twitch Alerts What is "Faster Twitch Alerts"? Faster Twitch Alerts is a highly customizable, lightning-fast alternative to Twitch's slow mobile

6 Dec 22, 2022
Quickly edit your slack posts.

Lightning Edit Quickly edit your Slack posts. Heavily inspired by @KhushrajRathod's LightningDelete. Usage: Note: Before anything, be sure to head ove

Cole Wilson 14 Nov 19, 2021
A telegram media to pixeldrain stream link bot

Pixeldrain-Bot A telegram media to pixeldrain stream link bot Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - ht

Fayas Noushad 11 Oct 21, 2022
An API wrapper around Discord API written in Python

Diskord This library is a maintained fork of now archived library, discord.py. A modern and easy to use API wrapper around Discord API written in Pyth

Diskord 36 Aug 22, 2022
Simple Discord bot which logs several events in your server

logging-bot Simple Discord bot which logs several events in your server, including: Message Edits Message Deletes Role Adds Role Removes Member joins

1 Feb 14, 2022
Automatically scrape all of your artifacts in Genshin Impact.

Genshin Artifact Scraper Automatically scrape all of your artifacts in Genshin Impact. Features: Simple recalibration (2 steps). GUI to select OCR reg

21 Dec 17, 2022
A file-based quote bot written in Python

Let's Write a Python Quote Bot! This repository will get you started with building a quote bot in Python. It's meant to be used along with the Learnin

1 Jan 19, 2022
KTUN Öğrenci Bilgi Sistemine bağlanıp her 15 dakikada notları kontrol eden ve değişiklik olduğu zaman size Discord Webhook ile mesaj atan uygulama.

KTUN_Obis KTUN Öğrenci Bilgi Sistemi KTUN Öğrenci Bilgi Sistemine selenium kullanarak girip setttings.py dosyasında verdiğiniz bilgeri doldurup ardınd

İbrahim Uysal 5 Oct 27, 2022
Binance Futures Client

Binance Futures Client

4 Aug 02, 2022
[Multithreading] [Proxy - auto & infile]

Discord-Token-Generator-AutoCheck [Multithreading] [Proxy - auto & infile] How to install? pip install -r requirements.txt run generator.py with pytho

Chakeaw__ 3 Oct 17, 2021
Python API for British Geological Survey magnetic field calculator

Magnetic field calculator Python API for British Geological Survey magnetic field calculator. Description This project magnetic field calculator. It u

Filip Š 3 Mar 11, 2022
Access LeetCode problems via id

LCid - access LeetCode problems via id Introduction As a world's leading online programming learning platform, LeetCode is quite popular among program

bunnyxt 14 Oct 08, 2022
Discord Voice Call DoS

VC DoS Simple, effective Discord DM/GC voice call Denial of Service. How to Use & FAQ 1. Download the script (obviously). 2. In CMD prompt, find the l

Roover 4 Feb 28, 2022
Enumerate Microsoft 365 Groups in a tenant with their metadata

Enumerate Microsoft 365 Groups in a tenant with their metadata Description The all_groups.py script allows to enumerate all Microsoft 365 Groups in a

Clément Notin 46 Dec 26, 2022
Bendford analysis of Ethereum transaction

Bendford analysis of Ethereum transaction The python script script.py extract from already downloaded archive file the ethereum transaction. The value

sleepy ramen 2 Dec 18, 2021
A frame to create discord bots (for myself) that uses cogs, JSON, activities, and more.

dpy-frame A frame to create discord bots (for myself) that uses cogs, JSON, activities, and more. NOTE: Documentation is incomplete, so please wait un

Apple Discord 1 Nov 06, 2021