Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

This project was completed in an attempt to become better acquainted with the latest big data tools. Further details can be found on my blog here.

The world is producing an exponentially increasing amount of digital data, and the tools we use to derive insights from data are evolving just as rapidly.

In recent years, a new architecture called the Data Lakehouse has begun to gain prominence as an enterprise solution to storing and processing big data. This trend piqued my interest and led to my exploration of some of the key underlying technologies fueling the revolution.

Of particular focus are two open-source technologies: Delta Lake and Apache Spark. Delta Lake provides a metadata layer to data lakes, bringing ACID transaction guarantees and time travel to a heretofore messy approach to data science at scale. Apache Spark offers a distributed processing engine for a diverse set of workloads (e.g., SQL queries, machine learning, stream processing), which can be programmed in Python, R, Scala, etc.

It is my belief that these technologies―among several others further detailed on my blog―will play a major role in how businesses leverage the power of data going forward. As such, this research prepares me well to confront many emerging data engineering and data science challenges.

The demonstration linked below is deployed using the Binder service, which processes a Jupyter notebook in the cloud, based on a custom Docker image described by the supporting files in this repository.

Live Link:

Contained in this repository:

Jupyter notebook demonstrating Apache Spark and Delta Lake
Files to construct a custom Docker image deployed using Binder
- Dockerfile
- docker-compose.yml
- requirements.txt

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Related tags

Overview

Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

Live Link:

Contained in this repository:

Owner

Validation and inference over LinkML instance data using souffle

TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

Tools for analyzing data collected with a custom unity-based VR for insects.

Pandas and Spark DataFrame comparison for humans

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

An Aspiring Drop-In Replacement for NumPy at Scale

NumPy and Pandas interface to Big Data

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Template for a Dataflow Flex Template in Python

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Generates a simple report about the current Covid-19 cases and deaths in Malaysia

A computer algebra system written in pure Python

A DSL for data-driven computational pipelines

This is a python script to navigate and extract the FSD50K dataset

Python package for analyzing sensor-collected human motion data

A stock analysis app with streamlit

Approximate Nearest Neighbor Search for Sparse Data in Python!

Describing statistical models in Python using symbolic formulas

Exploratory data analysis

pandas: powerful Python data analysis toolkit