visualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem

Last update: Dec 12, 2022

Overview

visualize_ML

visualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklean,scipy for statistical computations.

Table of content:

Requirement

python 2.x or python 3.x

Install

Install dependencies needed for matplotlib

sudo apt-get build-dep python-matplotlib

Install it using pip

pip install visualize_ML

Let's Code

While dealing with a Machine Learning problem some of the initial steps involved are data exploration,analysis followed by feature selection.Below are the modules for these tasks.

1) Data Exploration

At this stage, we explore variables one by one using Uni-variate Analysis which depends on whether the variable type is categorical or continuous .To deal with this we have the explore module.

>>> explore module

visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20,
bar_width=0.2,wspace=0.5,hspace=0.8)

Continuous Variables : In case of continous variables it plots the Histogram for every variable and gives descriptive statistics for them.

Categorical Variables : In case on categorical variables with 2 or more classes it plots the Bar chart for every variable and gives descriptive statistics for them.

Parameters	Type	Description
data_input	Dataframe	This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)
categorical_name	list (default=[ ])	Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.
drop	list default=[ ]	Names of columns to be dropped.
PLOT_COLUMNS_SIZE	int (default=4)	Number of plots to display vertically in the display window.The row size is adjusted accordingly.
bin_size	int (default="auto")	Number of bins for the histogram displayed in the categorical vs categorical category.
wspace	float32 (default = 0.5)	Horizontal padding between subplot on the display window.
hspace	float32 (default = 0.8)	Vertical padding between subplot on the display window.

Code Snippet

/* The data set is taken from famous Titanic data(Kaggle)*/

import pandas as pd
from visualize_ML import explore
df = pd.read_csv("dataset/train.csv")
explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"])

see the dataset

Note: While plotting all the rows with NaN values and columns with Character values are removed(except if values are True and False ),only numeric data is plotted.

2) Feature Selection

This is one of the challenging task to deal with for a ML task.Here we have to do Bi-variate Analysis to find out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.

relation module helps in visualizing the analysis done on various combination of variables and see relation between them.

>>> relation module

visualize_ML.relation.plot(data_input,target_name="",categorical_name=[],drop=[],bin_size=10)

Continuous vs Continuous variables: To do the Bi-variate analysis scatter plots are made as their pattern indicates the relationship between variables. To indicates the strength of relationship amongst them we use Correlation between them.

The graph displays the correlation coefficient along with other information.

Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y))

-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation

Categorical vs Categorical variables: Stacked Column Charts are made to visualize the relation.Chi square test is used to derive the statistical significance of relationship between the variables. It returns probability for the computed chi-square distribution with the degree of freedom. For more information on Chi Test see this

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

The graph displays the p_value along with other information. If it is leass than 0.05 it states that the variables are dependent.

Categorical vs Continuous variables: To explore the relation between categorical and continuous variables,box plots re drawn at each level of categorical variables. If levels are small in number, it will not show the statistical significance. ANOVA test is used to derive the statistical significance of relationship between the variables.

The graph displays the p_value along with other information. If it is leass than 0.05 it states that the variables are dependent.

For more information on ANOVA test see this

Parameters	Type	Description
data_input	Dataframe	This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)
target_name	String	The name of the target column.
categorical_name	list (default=[ ])	Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.
drop	list default=[ ]	Names of columns to be dropped.
PLOT_COLUMNS_SIZE	int (default=4)	Number of plots to display vertically in the display window.The row size is adjusted accordingly.
bin_size	int (default="auto")	Number of bins for the histogram displayed in the categorical vs categorical category.
wspace	float32 (default = 0.5)	Horizontal padding between subplot on the display window.
hspace	float32 (default = 0.8)	Vertical padding between subplot on the display window.

Code Snippet

/* The data set is taken from famous Titanic data(Kaggle)*/
import pandas as pd
from visualize_ML import relation
df = pd.read_csv("dataset/train.csv")
relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10)

see the dataset

Note: While plotting all the rows with NaN values and columns with Non numeric values are removed only numeric data is plotted.Only categorical taget variable with string values are allowed.

Contribute

If you want to contribute and add new feature feel free to send Pull request here

This project is still under development so to report any bugs or request new features, head over to the Issues page

Tasks To Do

Make input compatible with other formats like Numpy.
Visualize best fit lines and decision boundaries for various models to make Parameter Tuning task easy.

and many others!

Licence

Licensed under The MIT License (MIT).

Copyright

You might also like...

Import, visualize, and analyze SpiderFoot OSINT data in Neo4j, a graph database

Comments

Can't get graphs to space right

Not sure what is going on tried looking at the code.. I'm using Jupyter notebook if that is messing stuff up? data: state region age gender race marital_status ptype status-grp 0 IA 3 73 M W M Patient NaN 1 IL 2 57 M W S Patient NaN 2 WI 2 32 F W U Patient NaN 3 WI 2 54 F W U Patient NaN 4 IL 2 56 F W M Patient NaN 5 WI 2 31 F W S Patient

input line: explore.plot(df2,['state','region','age','gender','race','marital_status','ptype','status-grp'],PLOT_COLUMNS_SIZE=2,bin_size=20, bar_width=0.2,wspace=.75,hspace=.75) result:

opened by dartdog 6
Just installed but it required and executed a downgrade of MPL

The PIP install downgraded MPL from 1.5.1 to 1.4.2 and also required the installation of "sudo apt-get install blt-dev" for freetype to build,, I had not previously run into that before? Any advice on how to preserve Matplotlib at 1.5.1 and of course MPL 2.0 is about to drop soon as well? The package looks quite useful with some nice ideas!

opened by dartdog 2

visualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem

Related tags

Overview

visualize_ML

Table of content:

Requirement

Install

Let's Code

1) Data Exploration

>>> explore module

2) Feature Selection

>>> relation module

Contribute

Tasks To Do

Licence

Copyright

You might also like...

Import, visualize, and analyze SpiderFoot OSINT data in Neo4j, a graph database

Extract and visualize information from Gurobi log files

Extract data from ThousandEyes REST API and visualize it on your customized Grafana Dashboard.

This is a web application to visualize various famous technical indicators and stocks tickers from user

Visualize the training curve from the *.csv file (tensorboard format).

Visualize your pandas data with one-line code

Flame Graphs visualize profiled code

Visualize data of Vietnam's regions with interactive maps.

Epagneul is a tool to visualize and investigate windows event logs

Comments

Can't get graphs to space right

Just installed but it required and executed a downgrade of MPL

Releases(0.2.2)

0.2.2(Aug 4, 2016)

0.1.2(Jul 31, 2016)

Owner

Ayush Singh

Boltzmann visualization - Visualize the Boltzmann distribution for simple quantum models of molecular motion

GD-UltraHack - A Mod Menu for Geometry Dash. Specifically a MegahackV5 clone in Python. Only for Windows

Scientific measurement library for instruments, experiments, and live-plotting

These data visualizations were created as homework for my CS40 class. I hope you enjoy!

🎨 Python Echarts Plotting Library

Area-weighted venn-diagrams for Python/matplotlib

Tweets your monthly GitHub Contributions as Wordle grid

This is a sorting visualizer made with Tkinter.

Set of matplotlib operations that are not trivial

Scientific Visualization: Python + Matplotlib

Graphing communities on Twitch.tv in a visually intuitive way

阴阳师后台全平台（使用网易 MuMu 模拟器）辅助。支持御魂，觉醒，御灵，结界突破，秘闻副本，地域鬼王。

Eulera Dashboard is an easy and intuitive way to get a quick feel of what’s happening on the world’s market.

Collection of scripts for making high quality beautiful math-related posters.

Smoking Simulation is an app to simulate the spreading of smokers and non-smokers, their interactions and population during certain amount of time.

Visualize the bitcoin blockchain from your local node

Bar Chart of the number of Senators from each party who are up for election in the next three General Elections

Calendar heatmaps from Pandas time series data

Here I plotted data for the average test scores across schools and class sizes across school districts.

Visualizations of some specific solutions of different differential equations.