A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

Last update: Dec 18, 2021

Related tags

Overview

FEATURE ENGINEERING

Business Problem: A data preprocessing and feature engineering script for a machine learning pipeline needs to be prepared. It is expected that the dataset will be ready for modelling when passed through this script.

Story of the Dataset:
The dataset is the dataset of the people who were in the Titanic shipwreck. It consists of 768 observations and 12 variables. The target variable is specified as "Survived";

0: indicates the person's inability to survive.

1: refers to the survival of the person.

ATTRIBUTES:

PassengerId: ID of the passenger

Survived: Survival status (0: not survived, 1: survived)

Pclass: Ticket class (1: 1st class (upper), 2: 2nd class (middle), 3: 3rd class(lower))

Name: Name of the passenger

Sex: Gender of the passenger (male, female)

Age: Age in years

Sibsp: Number of siblings/spouses aboard the Titanic
Sibling = Brother, sister, stepbrother, stepsister
Spouse = Husband, wife (mistresses and fiances were ignored)

Parch: Number of parents/children aboard the Titanic
Parent = Mother, father
Child = Daughter, son, stepdaughter, stepson
Some children travelled only with a nanny , therefore Parch = 0 for them.

Ticket: Ticket number # Fare: Passenger fare

Cabin: Cabin number

Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

REFERENCE: Data Science and ML Boot Camp, 2021, Veri Bilimi Okulu (https://www.veribilimiokulu.com/)

A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

Related tags

Overview

Owner

Pinar Oner

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Binary Classification Problem with Machine Learning

Predicting diabetes over a five year period using logistic regression and the Pima First-Nation dataset

Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

The Ultimate FREE Machine Learning Study Plan

Fit interpretable models. Explain blackbox machine learning.

Magenta: Music and Art Generation with Machine Intelligence

ANNchor is a python library which constructs approximate k-nearest neighbour graphs for slow metrics.

PyHarmonize: Adding harmony lines to recorded melodies in Python

Dragonfly is an open source python library for scalable Bayesian optimisation.

TensorFlow implementation of an arbitrary order Factorization Machine

A Multipurpose Library for Synthetic Time Series Generation in Python

scikit-multimodallearn is a Python package implementing algorithms multimodal data.

Self Organising Map (SOM) for clustering of atomistic samples through unsupervised learning.

Decision tree is the most powerful and popular tool for classification and prediction

stability-selection - A scikit-learn compatible implementation of stability selection

K-means clustering is a method used for clustering analysis, especially in data mining and statistics.