AWS Glue PySpark - Apache Hudi Quick Start Guide

Overview

AWS Glue PySpark - Apache Hudi Quick Start Guide

Disclaimer:

This is a quick start guide for the Apache Hudi Python Spark connector, running on AWS Glue.

It's also specifically configured for the following Glue version:

  • AWS Glue 3.0
    • Spark 3.1.1
    • Python 3.7

Glue Configuration Reference: https://docs.aws.amazon.com/glue/latest/dg/add-job.html

Apache Hudi Reference: https://hudi.apache.org/docs/quick-start-guide/ for more information

Prerequisites:

- Python 3.6 or higher
- AWS CLI - Profile named 'dev' with Administrator Access (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html)

Folder Structure:

glue-hudi-hello
├── README.md
├── cloud-formation
│   ├── command.md
│   └── GlueJobPySparkHudi.yaml
├── jars
│   ├── command.md
│   ├── hudi-spark3-bundle_2.12-0.9.0.jar
│   └── spark-avro_2.12-3.0.1.jar
├── job
│   ├── command.md
│   └── job.py
│   └── upload_job.py
├── requirements.txt

Step 1: Create and activate a virtualenv:

Create a new virtual environment for the project in its root directory:

python3 -m venv venv

Activate it:

source venv/bin/activate

Run from the root directory the pip install to get boto3.

pip install -r requirements.txt

Step 2: Create the AWS Resources:

Now, with a aws configured profile named as dev, cd into the cloud-formation folder and run the command in command.md.

As a AWS Cloud Formation exercise, read the command Parameters and how they are used on the GlueJobPySparkHudi.yaml file to dynamically create the Glue Job and S3 Bucket.

Step 3: Upload the Job and Jars to S3:

cd into the job folder and run the command in command.md.

cd into the jars folder and run the commands in command.md. Note: There is one command for each jar.

Step 4: Check AWS Resources results:

Log into aws console and check the Glue Job and S3 Bucket.

On the AWS Glue console, you can run the Glue Job by clicking on the job name.

After the job is finished, you can check the Glue Data Catalog and query the new database from AWS Athena.

On AWS Athena check for the database: hudi_demo and for the table: hudi_trips.

Owner
Gabriel Amazonas Mesquita
Gabriel Amazonas Mesquita
Bot for mirroring one or multiple Twitter accounts in Pleroma/Mastodon.

Stork (pleroma-bot) Mirror one or multiple Twitter accounts in Pleroma/Mastodon. Introduction After using the pretty cool mastodon-bot for a while, I

73 Jan 08, 2023
Python written Rule34 API

Python written Rule34 API

1 Nov 11, 2021
Telegram bot which has truecaller and smsbomber features

Truecaller-telegram_bot Add your telegram bot api key in main.py and you are good to go To get a api key Goto telegram and search BotFather From the c

Rudranag 32 Dec 05, 2022
A powerful application to automatically deploy GitHub Release.

A powerful application to automatically deploy GitHub Release.

Fentaniao 43 Sep 17, 2022
A discord self-bot to automate shitposting for your everyday needs.

Shitpost Selfbot A discord self-bot to automate shitposting for your everyday needs. Caution: May be a little racist. I have no clue where we are taki

stormy 1 Mar 31, 2022
A simple test repo created following docker docs.

docker_sampleRepo A simple test repo created following docker docs. Link to docs: https://docs.docker.com/language/python/develop/ Other links: https:

Suraj Verma 2 Sep 16, 2022
Reddit bot that uses sentiment analysis

Reddit Bot Project 2: Neural Network Boogaloo Reddit bot that uses sentiment analysis from NLTK.VADER WIP_WIP_WIP_WIP_WIP_WIP Link to test subreddit:

TpK 1 Oct 24, 2021
Cities bot - A simple example of using aiogram and the wikipedia package

Cities game A simple example of using aiogram and the wikipedia package. The bot

Artem Meller 2 Jan 29, 2022
Python client for ETAPI of Trilium Note.

Python client for ETAPI of Trilium Note.

33 Dec 31, 2022
Azure free vpn for students only! (Self hosted/No sketchy services/Fast and free)

Azpn-Azure-Free-VPN Azure free vpn for students only! (Self hosted/No sketchy services/Fast and free) This is an alternative secure way of accessing f

Harishankar Kumar 6 Mar 19, 2022
A casino discord bot written in Python

Casino Bot Casino bot is a gambling discord bot I made for my friends. It is able to play blackjack, slots, flip a coin, and roll dice. It stores ever

Connor Swislow 27 Dec 30, 2022
Python bot for send videos of a Youtube channel to a telegram group , channel or chat

py_youtube_to_telegram Usage: If you want to install ytt and use it, run this command: sudo sh -c "$(curl -fsSL https://raw.githubusercontent.com/nima

Nima Fanniasl 8 Nov 22, 2022
ServiceX DID Finder Girder

ServiceX_DID_Finder_Girder Access datasets for ServiceX from yt Hub Finding datasets This DID finder is designed to take a collection id (https://gird

1 Dec 07, 2021
Instant messaging client in tkinter

Concord_client_tk Instant messaging client in tkinter Contributors : Ilade-s [https://github.com/Ilade-s] Doku [https://github.com/D0kuhebi] Descripti

Raphaël Merlet 2 Jun 15, 2022
Drover is a command-line utility for deploying Python packages to Lambda functions.

drover drover: a command-line utility for deploying Python packages to Lambda functions. Background This utility aims to provide a simple, repeatable,

Jeffrey Wilges 4 May 19, 2021
A working bypass for discord gc spamming

IllusionGcSpammer A working bypass for discord gc spamming Installation Run pip install pip install DiscordGcSpammer then your good to go. Usage You c

6 Sep 30, 2022
Build a better understanding of your data in PostgreSQL.

Data Fluent for PostgreSQL Build a better understanding of your data in PostgreSQL. The following shows an example report generated by this tool. It g

Mark Litwintschik 28 Aug 30, 2022
Fastest Pancakeswap Sniper BOT TORNADO CASH 2022-V1 (MAC WINDOWS ANDROID LINUX)

Fastest Pancakeswap Sniper BOT TORNADO CASH 2022-V1 (MAC WINDOWS ANDROID LINUX) ⭐️ AUTO BUY TOKEN ON LAUNCH AFTER ADD LIQUIDITY ⭐️ ⭐️ Support Uniswap

Crypto Trader 7 Jan 31, 2022
Lumi-Bot - Discord bot that fetches cryptocurrency prices utilizing CoinGeko API

Lumi-Bot Discord bot that fetches and monitors cryptocurrency prices utilizing C

Diego Castro 2 Oct 08, 2022
:snake: Python SDK to query Scaleway APIs.

Scaleway SDK Python SDK to query Scaleway's APIs. Stable release: Development: Installation The package is available on pip. To install it in a virtua

Scaleway 114 Dec 11, 2022