Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

This project was completed in an attempt to become better acquainted with the latest big data tools. Further details can be found on my blog here.

The world is producing an exponentially increasing amount of digital data, and the tools we use to derive insights from data are evolving just as rapidly.

In recent years, a new architecture called the Data Lakehouse has begun to gain prominence as an enterprise solution to storing and processing big data. This trend piqued my interest and led to my exploration of some of the key underlying technologies fueling the revolution.

Of particular focus are two open-source technologies: Delta Lake and Apache Spark. Delta Lake provides a metadata layer to data lakes, bringing ACID transaction guarantees and time travel to a heretofore messy approach to data science at scale. Apache Spark offers a distributed processing engine for a diverse set of workloads (e.g., SQL queries, machine learning, stream processing), which can be programmed in Python, R, Scala, etc.

It is my belief that these technologies―among several others further detailed on my blog―will play a major role in how businesses leverage the power of data going forward. As such, this research prepares me well to confront many emerging data engineering and data science challenges.

The demonstration linked below is deployed using the Binder service, which processes a Jupyter notebook in the cloud, based on a custom Docker image described by the supporting files in this repository.

Live Link:

Contained in this repository:

Jupyter notebook demonstrating Apache Spark and Delta Lake
Files to construct a custom Docker image deployed using Binder
- Dockerfile
- docker-compose.yml
- requirements.txt

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Related tags

Overview

Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

Live Link:

Contained in this repository:

Owner

follow-analyzer helps GitHub users analyze their following and followers relationship

Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

pyETT: Python library for Eleven VR Table Tennis data

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

.npy, .npz, .mtx converter.

Additional tools for particle accelerator data analysis and machine information

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

Single-Cell Analysis in Python. Scales to >1M cells.

Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Python package for processing UC module spectral data.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Flood modeling by 2D shallow water equation

Semi-Automated Data Processing

Project under the certification "Data Analysis with Python" on FreeCodeCamp

Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

A data parser for the internal syncing data format used by Fog of World.

CRISP: Critical Path Analysis of Microservice Traces

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit