visualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem

Overview

visualize_ML

visualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklean,scipy for statistical computations.

PyPI version

Table of content:

Requirement

  • python 2.x or python 3.x

Install

Install dependencies needed for matplotlib

sudo apt-get build-dep python-matplotlib

Install it using pip

pip install visualize_ML

Let's Code

While dealing with a Machine Learning problem some of the initial steps involved are data exploration,analysis followed by feature selection.Below are the modules for these tasks.

1) Data Exploration

At this stage, we explore variables one by one using Uni-variate Analysis which depends on whether the variable type is categorical or continuous .To deal with this we have the explore module.

>>> explore module

visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20,
bar_width=0.2,wspace=0.5,hspace=0.8)

Continuous Variables : In case of continous variables it plots the Histogram for every variable and gives descriptive statistics for them.

Categorical Variables : In case on categorical variables with 2 or more classes it plots the Bar chart for every variable and gives descriptive statistics for them.

Parameters Type Description
data_input Dataframe This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)
categorical_name list (default=[ ]) Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.
drop list default=[ ] Names of columns to be dropped.
PLOT_COLUMNS_SIZE int (default=4) Number of plots to display vertically in the display window.The row size is adjusted accordingly.
bin_size int (default="auto") Number of bins for the histogram displayed in the categorical vs categorical category.
wspace float32 (default = 0.5) Horizontal padding between subplot on the display window.
hspace float32 (default = 0.8) Vertical padding between subplot on the display window.

Code Snippet

/* The data set is taken from famous Titanic data(Kaggle)*/

import pandas as pd
from visualize_ML import explore
df = pd.read_csv("dataset/train.csv")
explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"])

Alt text

see the dataset

Note: While plotting all the rows with NaN values and columns with Character values are removed(except if values are True and False ),only numeric data is plotted.

2) Feature Selection

This is one of the challenging task to deal with for a ML task.Here we have to do Bi-variate Analysis to find out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.

relation module helps in visualizing the analysis done on various combination of variables and see relation between them.

>>> relation module

visualize_ML.relation.plot(data_input,target_name="",categorical_name=[],drop=[],bin_size=10)

Continuous vs Continuous variables: To do the Bi-variate analysis scatter plots are made as their pattern indicates the relationship between variables. To indicates the strength of relationship amongst them we use Correlation between them.

The graph displays the correlation coefficient along with other information.

Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y))
  • -1: perfect negative linear correlation
  • +1:perfect positive linear correlation and
  • 0: No correlation

Categorical vs Categorical variables: Stacked Column Charts are made to visualize the relation.Chi square test is used to derive the statistical significance of relationship between the variables. It returns probability for the computed chi-square distribution with the degree of freedom. For more information on Chi Test see this

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

The graph displays the p_value along with other information. If it is leass than 0.05 it states that the variables are dependent.

Categorical vs Continuous variables: To explore the relation between categorical and continuous variables,box plots re drawn at each level of categorical variables. If levels are small in number, it will not show the statistical significance. ANOVA test is used to derive the statistical significance of relationship between the variables.

The graph displays the p_value along with other information. If it is leass than 0.05 it states that the variables are dependent.

For more information on ANOVA test see this

Parameters Type Description
data_input Dataframe This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)
target_name String The name of the target column.
categorical_name list (default=[ ]) Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.
drop list default=[ ] Names of columns to be dropped.
PLOT_COLUMNS_SIZE int (default=4) Number of plots to display vertically in the display window.The row size is adjusted accordingly.
bin_size int (default="auto") Number of bins for the histogram displayed in the categorical vs categorical category.
wspace float32 (default = 0.5) Horizontal padding between subplot on the display window.
hspace float32 (default = 0.8) Vertical padding between subplot on the display window.

Code Snippet

/* The data set is taken from famous Titanic data(Kaggle)*/
import pandas as pd
from visualize_ML import relation
df = pd.read_csv("dataset/train.csv")
relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10)

Alt text

see the dataset

Note: While plotting all the rows with NaN values and columns with Non numeric values are removed only numeric data is plotted.Only categorical taget variable with string values are allowed.

Contribute

If you want to contribute and add new feature feel free to send Pull request here

This project is still under development so to report any bugs or request new features, head over to the Issues page

Tasks To Do

  • Make input compatible with other formats like Numpy.

  • Visualize best fit lines and decision boundaries for various models to make Parameter Tuning task easy.

    and many others!

Licence

Licensed under The MIT License (MIT).

Copyright

ayush1997(c) 2016

You might also like...
Import, visualize, and analyze SpiderFoot OSINT data in Neo4j, a graph database
Import, visualize, and analyze SpiderFoot OSINT data in Neo4j, a graph database

SpiderFoot Neo4j Tools Import, visualize, and analyze SpiderFoot OSINT data in Neo4j, a graph database Step 1: Installation NOTE: This installs the sf

Extract and visualize information from Gurobi log files
Extract and visualize information from Gurobi log files

GRBlogtools Extract information from Gurobi log files and generate pandas DataFrames or Excel worksheets for further processing. Also includes a wrapp

Extract data from ThousandEyes REST API and visualize it on your customized Grafana Dashboard.
Extract data from ThousandEyes REST API and visualize it on your customized Grafana Dashboard.

ThousandEyes Grafana Dashboard Extract data from the ThousandEyes REST API and visualize it on your customized Grafana Dashboard. Deploy Grafana, Infl

This is  a web application to visualize various famous technical indicators and stocks tickers from user
This is a web application to visualize various famous technical indicators and stocks tickers from user

Visualizing Technical Indicators Using Python and Plotly. Currently facing issues hosting the application on heroku. As soon as I am able to I'll like

Visualize the training curve from the *.csv file (tensorboard format).
Visualize the training curve from the *.csv file (tensorboard format).

Training-Curve-Vis Visualize the training curve from the *.csv file (tensorboard format). Feature Custom labels Curve smoothing Support for multiple c

Visualize your pandas data with one-line code
Visualize your pandas data with one-line code

PandasEcharts 简介 基于pandas和pyecharts的可视化工具 安装 pip 安装 $ pip install pandasecharts 源码安装 $ git clone https://github.com/gamersover/pandasecharts $ cd pand

 Flame Graphs visualize profiled code
Flame Graphs visualize profiled code

Flame Graphs visualize profiled code

Visualize data of Vietnam's regions with interactive maps.
Visualize data of Vietnam's regions with interactive maps.

Plotting Vietnam Development Map This is my personal project that I use plotly to analyse and visualize data of Vietnam's regions with interactive map

 Epagneul is a tool to visualize and investigate windows event logs
Epagneul is a tool to visualize and investigate windows event logs

epagneul Epagneul is a tool to visualize and investigate windows event logs. Dep

Comments
  • Can't get graphs to space right

    Can't get graphs to space right

    Not sure what is going on tried looking at the code.. I'm using Jupyter notebook if that is messing stuff up? data: state region age gender race marital_status ptype status-grp 0 IA 3 73 M W M Patient NaN 1 IL 2 57 M W S Patient NaN 2 WI 2 32 F W U Patient NaN 3 WI 2 54 F W U Patient NaN 4 IL 2 56 F W M Patient NaN 5 WI 2 31 F W S Patient

    input line: explore.plot(df2,['state','region','age','gender','race','marital_status','ptype','status-grp'],PLOT_COLUMNS_SIZE=2,bin_size=20, bar_width=0.2,wspace=.75,hspace=.75) result: vizml

    opened by dartdog 6
  • Just installed but it required and executed a downgrade of MPL

    Just installed but it required and executed a downgrade of MPL

    The PIP install downgraded MPL from 1.5.1 to 1.4.2 and also required the installation of "sudo apt-get install blt-dev" for freetype to build,, I had not previously run into that before? Any advice on how to preserve Matplotlib at 1.5.1 and of course MPL 2.0 is about to drop soon as well? The package looks quite useful with some nice ideas!

    opened by dartdog 2
Releases(0.2.2)
Owner
Ayush Singh
Machine Learning | Computer Vision | Data Science | Python
Ayush Singh
A small script written in Python3 that generates a visual representation of the Mandelbrot set.

Mandelbrot Set Generator A small script written in Python3 that generates a visual representation of the Mandelbrot set. Abstract The colors in the ou

1 Dec 28, 2021
Data parsing and validation using Python type hints

pydantic Data validation and settings management using Python type hinting. Fast and extensible, pydantic plays nicely with your linters/IDE/brain. De

Samuel Colvin 12.1k Jan 06, 2023
Automatic data visualization in atom with the nteract data-explorer

Data Explorer Interactively explore your data directly in atom with hydrogen! The nteract data-explorer provides automatic data visualization, so you

Ben Russert 65 Dec 01, 2022
Create a table with row explanations, column headers, using matplotlib

Create a table with row explanations, column headers, using matplotlib. Intended usage was a small table containing a custom heatmap.

4 Aug 14, 2022
Getting started with Python, Dash and Plot.ly for the Data Dashboards team

data_dashboards Getting started with Python, Dash and Plot.ly for the Data Dashboards team Getting started MacOS users: # Install the pyenv version ma

Department for Levelling Up, Housing and Communities 1 Nov 08, 2021
Create a visualization for Trump's Tweeted Words Using Python

Data Trump's Tweeted Words This plot illustrates twitter word occurences. We already did the coding I needed for this plot, so I was very inspired to

7 Mar 27, 2022
YOPO is an interactive dashboard which generates various standard plots.

YOPO is an interactive dashboard which generates various standard plots.you can create various graphs and charts with a click of a button. This tool uses Dash and Flask in backend.

ADARSH C 38 Dec 20, 2022
Visualize large time-series data in plotly

plotly_resampler enables visualizing large sequential data by adding resampling functionality to Plotly figures. In this Plotly-Resampler demo over 11

PreDiCT.IDLab 604 Dec 28, 2022
Compute and visualise incidence (reworking of the original incidence package)

incidence2 incidence2 is an R package that implements functions and classes to compute, handle and visualise incidence from linelist data. It refocuss

15 Nov 22, 2022
Lumen provides a framework for visual analytics, which allows users to build data-driven dashboards from a simple yaml specification

Lumen project provides a framework for visual analytics, which allows users to build data-driven dashboards from a simple yaml specification

HoloViz 120 Jan 04, 2023
Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax.

PyDexter Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax. Setup $ pip install PyDexter

D3xter 31 Mar 06, 2021
The Spectral Diagram (SD) is a new tool for the comparison of time series in the frequency domain

The Spectral Diagram (SD) is a new tool for the comparison of time series in the frequency domain. The SD provides a novel way to display the coherence function, power, amplitude, phase, and skill sc

Mabel 3 Oct 10, 2022
python partial dependence plot toolbox

PDPbox python partial dependence plot toolbox Motivation This repository is inspired by ICEbox. The goal is to visualize the impact of certain feature

Li Jiangchun 723 Jan 07, 2023
A python script to visualise explain plans as a graph using graphviz

README Needs to be improved Prerequisites Need to have graphiz installed on the machine. Refer to https://graphviz.readthedocs.io/en/stable/manual.htm

Edward Mallia 1 Sep 28, 2021
A simple python script using Numpy and Matplotlib library to plot a Mohr's Circle when given a two-dimensional state of stress.

Mohr's Circle Calculator This is a really small personal project done for Department of Civil Engineering, Delhi Technological University (formerly, D

Agyeya Mishra 0 Jul 17, 2021
Homework 2: Matplotlib and Data Visualization

Homework 2: Matplotlib and Data Visualization Overview These data visualizations were created for my introductory computer science course using Python

Sophia Huang 12 Oct 20, 2022
Visualise top-rated GitHub repositories in a barchart by keyword

This python script was written for simple purpose -- to visualise top-rated GitHub repositories in a barchart by keyword. Script generates html-page with barchart and information about repository own

Cur1iosity 2 Feb 07, 2022
A napari plugin for visualising and interacting with electron cryotomograms.

napari-tomoslice A napari plugin for visualising and interacting with electron cryotomograms. Installation You can install napari-tomoslice via pip: p

3 Jan 03, 2023
Python package for hypergraph analysis and visualization.

The HyperNetX library provides classes and methods for the analysis and visualization of complex network data. HyperNetX uses data structures designed to represent set systems containing nested data

Pacific Northwest National Laboratory 304 Dec 27, 2022
A filler visualizer built using python

filler-visualizer 42 filler のログをビジュアライズしてスポーツさながら楽しむことができます! Usage (標準入力でvisualizer.pyに渡せばALL OK) 1. 既にあるログをビジュアライズする $ ./filler_vm -t 3 -p1 john_fill

Takumi Hara 1 Nov 04, 2021