songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Last update: Jul 13, 2021

Related tags

Overview

Songplays User activity datamart

The following document describes the model used to build the songplays datamart table and the respective ETL process.

About
Getting Started
Data Model and Schema
Deployment
Built Using
Authors

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema

Source files and owners

File or table	Description	Directory	Owner
YYYY-MM-DD-events.json	User events.	mnt/data/log_data/YYYY/11	Person 1
.json	Song data.	mnt/data/song_data/a	Person 2
`songplays`	Datamart for recomendation system.	`sparkify_app.songplays`	Person 3
`artists`	Dimension table for artists.	`sparkify_app.artists`	Person 1
`songs`	Dimension table for songs.	`sparkify_app.songs`	Person 1
`time`	Dimension table for streaming start time for a given song.	`sparkify_app.time`	Person 2
`users`	Dimension table for users.	`sparkify_app.users`	Person 3

Prerequisites

To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

Open your browser and access localhost:16543:
- Enter with the following credentials to authenticate:
  - e-mail: [email protected]
  - password: sp4rk1fy
After you log in, click on the Servers option at the upper corner on the left:
- You will be asked to enter with the PostgreSQL credentials:
  - User: sparkifypsql
  - Password: p4ssw0rd
Select the Query Tools under the Tools menu:

Under the Query Editor, run the following query:

SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;

You should get only 5 rows.

Microservice architecture

The following image represents the microservice architecture for this project:

Where:

sparkify_python: runs all Python scripts and stores raw data.
sparkify_postgres: runs Postgre and stores the database.
sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

Dbeaver - Database tool.
Docker Compose - Tool to run multi-container applications.
Docker Engine - Container engine.
pandas - Data analysis and data wrangling tool.
pgAdmin - PostgreSQL tool.
psycopg2 - Database adapter for Python.
PostgreSQL - Reletional database management system.

✍️ Authors

@lkellermann - Idea & Initial work

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Related tags

Overview

Songplays User activity datamart

Table of Contents

About

🏁 Getting Started

Data Model and Schema

Prerequisites

Microservice architecture

⛏️ Built Using

✍️ Authors

Owner

Leandro Kellermann de Oliveira

Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Zipline, a Pythonic Algorithmic Trading Library

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

Binance Kline Data With Python

A crude Hy handle on Pandas library

Integrate bus data from a variety of sources (batch processing and real time processing).

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Fast, flexible and easy to use probabilistic modelling in Python.

Approximate Nearest Neighbor Search for Sparse Data in Python!

Statistical Rethinking course winter 2022

Incubator for useful bioinformatics code, primarily in Python and R

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

A simplified prototype for an as-built tracking database with API

A model checker for verifying properties in epistemic models

Data Science Environment Setup in single line

Fancy data functions that will make your life as a data scientist easier.

Nobel Data Analysis

Data exploration done quick.