ETL pipeline on movie data using Python and postgreSQL

Last update: Jul 07, 2021

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

Data sources:
- movies_metadata.csv
- ratings.csv
- wikipedia_movies.json
Software:
- Python
- PostgreSQL
- Pandas
- SQLAlchemy
- Regular Expressions

Results

Final output table: FINAL_Merged_Movies_and_Ratings.csv
Datasets uploaded to PostgreSQL for other users to analyze movie data (Hacketon):

Summary

The pipeline was created under the following assumptions:

I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
The wikipedia dataset didn't have a IMDB ID, so I had to extract it from the url link given.
Each dataset had to be cleaned on their own because they had overlapping columns, suck as 'Director' and 'Directed By', unecessary columns, many null values, TV shows, outliers, duplicates, incorrect data types, formatting, and other errors.
The wikipedia movie data was in json format.
Not every every movie had a rating for each rating level.
The ratings dataset had more than 26 million entries which generated a time constraint and a processing data challenge.

ETL pipeline on movie data using Python and postgreSQL

Related tags

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

Resources

Results

Summary

Owner

Juan Nicolas Serrano

Show you how to integrate Zeppelin with Airflow

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

A Python package for the mathematical modeling of infectious diseases via compartmental models

Visions provides an extensible suite of tools to support common data analysis operations

INFO-H515 - Big Data Scalable Analytics

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Bearsql allows you to query pandas dataframe with sql syntax.

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

A set of procedures that can realize covid19 virus detection based on blood.

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

NumPy and Pandas interface to Big Data

COVID-19 deaths statistics around the world

Import, connect and transform data into Excel

a tool that compiles a csv of all h1 program stats

University Challenge 2021 With Python

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).