Semi-Automated Data Processing

Overview

Semi-Automated Data Processing

Preparing data for model learning is one of the most important steps in any project—and traditionally, one of the most time consuming. Data Analysis plays a very important role in the entire Data Science Workflow. In fact, this takes most of the time of the Data science Workflow. There’s a nice quote (not sure who said it)According to Wikipedia, In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.**“In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.”*

This projects handles the task with minimal user interaction by analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the project in semi-interactive fashion, previewing the changes before they are made and accept or reject them as you want.

This project cover the 3 steps in any project workflow, comes before the model training:
1) Exploratory data analysis
2) Feature engineering
3) Feature selection


All these steps has to be carried out by the user by calling the several functions as follows:

1) identify_feature(data)=
This function identifies the categorical, continuous numerical and discrete numerical features in the datset. It also identifies datetime feature and extracts the relevant info from it.

Input:
data=Dataset

Output:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

2) plot_nan_feature(data, continuous_features, discrete_features, categorical_features,dependent_var)=
It identifies the missing values in the dataset and visualize them their impact on dependent feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
dependent_var= Dependent feature name in string format

Output:
df= Dataset
nan_features= List of feature names containing NaN values

3) visualize_imputation_impact(data,continuous_features, discrete_features, categorical_features,nan_features,dependent_var):
The function visualizes the impact of different NaN value impution on the distribution of values the feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
nan_features= List of feature names containing NaN values
dependent_var= Dependent feature name in string format

Output:
None

4) nan_imputation(data,mean_feature,median_feature,mode_feature,random_feature,new_category):
The function imputes the NaN values in the feature as per the user input.

Input:
data=Dataset
mean_feature= List of feature names in which we have to carry out mean_imputation
median_feature=List of feature names in which we have to carry out median_imputation
mode_feature=List of feature names in which we have to carry out mode_imputation
random_feature=List of feature names in which we have to carry out random_imputation
new_category=List of feature names in which we we create a new category for the NaN values

Output:
None

5) cross_visualization(data,continuous_features,discrete_features, categorical_features,dt_features):
The function visualise the relationship between the different independent features.

Input:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

Output:
continuous_features2=List of features names associated containing continuous numerical values, except the dependent feature

6) dependent_independent_visualization(data,continuous_features,discrete_features, categorical_features,dt_features,dependent_feature):
The function visualise the relationship between the different independent features.

Input:
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values
dependent_var= Dependent feature name in string format

Output:
None

7) outlier_removal(data,continuous_features,discrete_features,dependent_var,dependent_var_type,action):
The function visualizes the outlliers using the boxplot and removes them.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_var= Dependent feature name in string format
dependent_var_type= Contain string tells if the problem is regression (than use 'Regression') or else
action= Give input as 'remove' to delete the rows associated with the outliers

Output:
df=Dataset

8) transformation_visualization(data,continuous_features,discrete_features,dependent_feature):
The function visualize the feature after performing various transormation techniques.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_feature= Dependent feature name in string format

Output:
None

9) feature_transformation(train_data,continuous_features,discrete_features,transformation,dependent_feature):
The function performing the feature transormation technique as per the user input.

Input:
train_data=Training dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
transformation=Type of transformation: none=No transformation, log=Log Transformation, sqrt= Square root Transformation, reciprocal= Reciprocal Transformation, exp= Exponential Transformation, boxcox=Boxcox Transformation
dependent_feature= Dependent feature name in string format

Output:
X_data=Training dataset

10) categorical_transformation(train_data,categorical_encoding):
This function transforms the categorical featres in the numerical ones using encoding techniques.

Input:
train_data=Training dataset
categorical_encoding={'one_hot_encoding':[],'frequency_encoding':[],'mean_encoding':[],'target_guided_ordinal_encoding':{}}

Output:
X_data=Training dataset

11a) feature_selection(Xtrain,ytrain, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
feature_df= Dataframe containig features with their pvalue

11b) feature_selection(Xtrain,ytrain,Xtest,ytest, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
Xtest=Test dataset
ytest=dependent data in test dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
Xtest= Test dataset
feature_df= Dataframe containig features with their pvalue

12) convert_dtype(data,categorical_features):
This function converts the categorical fetaures containing the numeric values but presented as categorical into the int format.

Input:
data= Dataset
categorical_features=List of features names associated containing categorical values

Output:
df=Dataset

Note:
Use same paramters for both train and test dataset for better accuracy


We have implemented a bike sharing project to describe how the functions can be used for both the classification and regression problem statement.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
ASTR 302: Python for Astronomy (Winter '22)

ASTR 302, Winter 2022, University of Washington: Python for Astronomy Mario Jurić Location When: 2:30-3:50, Monday & Wednesday, Winter quarter 2022 Wh

UW ASTR 302: Python for Astronomy 4 Jan 12, 2022
collect training and calibration data for gaze tracking

Collect Training and Calibration Data for Gaze Tracking This tool allows collecting gaze data necessary for personal calibration or training of eye-tr

Pascal 5 Dec 17, 2022
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Isabela Caetano 1 Dec 09, 2021
BErt-like Neurophysiological Data Representation

BENDR BErt-like Neurophysiological Data Representation This repository contains the source code for reproducing, or extending the BERT-like self-super

114 Dec 23, 2022
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
Employee Turnover Analysis

Employee Turnover Analysis Submission to the DataCamp competition "Can you help reduce employee turnover?"

Jannik Wiedenhaupt 1 Feb 13, 2022
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022