Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Last update: Jan 23, 2022

Related tags

Data Analysis igti-desafio-4-cde

Overview

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI

Objetivos

Criar infraestrutura como código
Utuilizando um cluster Kubernetes na Azure
- Ingestão dos dados do Enade 2017 com python para o datalake na Azure
- Transformar os dados da camada bronze para camada silver usando delta format
- Enrriquecer os dados da camada silver para camada gold usando delta format
Utilizar Azure Synapse Serveless SQL Poll para servir os dados

Arquitetura

Passos

Criar infra

source infra/00-variables

bash infra/01-create-rg.sh

bash infra/02-create-cluster-k8s.sh

bash infra/03-create-lake.sh

bash infra/04-create-synapse.sh

bash infra/05-access-assignments.sh

Preparar k8s

Baixar kubeconfig file

bash infra/02-get-kubeconfig.sh

Para facilitar os comandos usar um alias

alias k=kubectl

Criar namespace

k create namespace processing
k create namespace ingestion

Criar Service Account e Role Bing

k apply -f k8s/crb-spark.yaml

Criar secrets

k create secret generic azure-service-account --from-env-file=.env --namespace processing
k create secret generic azure-service-account --from-env-file=.env --namespace ingestion

Intalar Spark Operator

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

helm repo update

helm install spark spark-operator/spark-operator --set image.tag=v1beta2-1.2.3-3.1.1 --namespace processing

Ingestion app

Ingestion Image

docker build ingestion -f ingestion/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4-ingestion --network=host

docker push otaciliopsf/cde-bootcamp:desafio-mod4-ingestion

Apply ingestion job

k8s/ingestion-job.yaml k apply -f k8s/ingestion-job.yaml ">

# primeiro mudar o nome unico do pod
cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/ingestion-job.yaml

k apply -f k8s/ingestion-job.yaml

Logs

ING_POD_NAME=$(cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")

k logs $ING_POD_NAME -n ingestion --follow

Spark

Criar Job Image

docker build spark -f spark/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4

docker push otaciliopsf/cde-bootcamp:desafio-mod4

Apply job

k8s/spark-job.yaml k apply -f k8s/spark-job.yaml ">

# primeiro muda o nome unico da Spark Application
cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/spark-job.yaml

k apply -f k8s/spark-job.yaml

logs

SPARK_APP_NAME=$(cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")'-driver'

k logs $SPARK_APP_NAME -n processing --follow

Azure Synapse Serveless SQL Poll

Acessar o Synapse workspace através do link gerado

bash infra/04-get-workspace-url.sh

Para começar a usar siga os passos

Rodar o conteudo do script create-synapse-view.sql no Synapse workspace para criar a view da tabela no lake

Pronto, o Synapse esta pronto para receber as querys.

Limpando os recursos

bash infra/99-delete-service-principal.sh

bash infra/99-delete-rg.sh

Conclusão

Seguindo os passos citados é possivel realizar querys direto na camada gold do delta lake utilizando o Synapse

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Related tags

Overview

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI

Objetivos

Arquitetura

Passos

Criar infra

Preparar k8s

Baixar kubeconfig file

Para facilitar os comandos usar um alias

Criar namespace

Criar Service Account e Role Bing

Criar secrets

Intalar Spark Operator

Ingestion app

Ingestion Image

Apply ingestion job

Logs

Spark

Criar Job Image

Apply job

logs

Azure Synapse Serveless SQL Poll

Limpando os recursos

Conclusão

Owner

Otacilio Filho

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Implementation in Python of the reliability measures such as Omega.

Flood modeling by 2D shallow water equation

This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

Hydrogen (or other pure gas phase species) depressurization calculations

Repository created with LinkedIn profile analysis project done

Analysis scripts for QG equations

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Bearsql allows you to query pandas dataframe with sql syntax.

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

Full ELT process on GCP environment.

pandas: powerful Python data analysis toolkit

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production