A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
Wordler - A program to support you to solve the wordle puzzles

solve wordle (https://www.powerlanguage.co.uk/wordle) A program to support you t

Viktor Martinović 2 Jan 17, 2022
A lighweight screen color picker tool

tkpick A lighweigt screen color picker tool Availability Only GNU/Linux 🐧 Installing Install via pip (No auto-update): [sudo] pip install tkpick Usa

Adil Gürbüz 7 Aug 30, 2021
🍕 A small app with capabilities ordering food and listing them with pub/sub pattern

food-ordering A small app with capabilities ordering food and listing them. Prerequisites Docker Run Tests docker-compose run --rm web ./manage.py tes

Muhammet Mücahit 1 Jan 14, 2022
This project intends to take the user's CEP (brazilian adress code) and return the local in which the CEP is placed.

This project aims to simply return the CEP's (the brazilian resident adress code) User of the application. The project uses a request and passes on to

Daniel Soares Saldanha 4 Nov 17, 2021
Xbps-install wrapper written in Python that doesn't care about case sensitiveness and package versions

xbi Xbps-install wrapper written in Python that doesn't care about case sensitiveness and package versions. Description This Python script can be easi

Emanuele Sabato 5 Apr 11, 2022
Hospitality app for ERPNext to manage hotels & restaurants.

Hospitality ERPNext Hospitality module is designed to handle workflows for Hotels and Restaurants. Manage Restaurants The Restaurant module in ERPNext

Frappe 19 Dec 26, 2022
Plock : A stack based programming language

Plock : A stack based programming language

1 Oct 25, 2021
Script de monitoramento de telemetria para missões espaciais, cansat e foguetemodelismo.

Aeroespace_GroundStation Script de monitoramento de telemetria para missões espaciais, cansat e foguetemodelismo. Imagem 1 - Dashboard realizando moni

Vinícius Azevedo 5 Nov 27, 2022
General Purpose Python Library by Techman

General Purpose Python Library by Techman

Jack Hubbard 0 Feb 09, 2022
Slientruss3d : Python for stable truss analysis tool

slientruss3d : Python for stable truss analysis tool Desciption slientruss3d is a python package which can solve the resistances, internal forces and

3 Dec 26, 2022
Pomodoro timer by the Algodrip team!

PomoDrip 🍅 Pomodoro timer by the Algo Drip team! To-do: Create the script for the pomodoro timer Design the front-end of the program (Flask or Javasc

Algodrip 3 Sep 12, 2021
ChainJacking is a tool to find which of your Go lang direct GitHub dependencies is susceptible to ChainJacking attack.

ChainJacking is a tool to find which of your Go lang direct GitHub dependencies is susceptible to ChainJacking attack.

Checkmarx 36 Nov 02, 2022
Blender Addon for Snapping a UV to a specific part of a Tilemap

UVGridSnapper A simple Blender Addon for easier texturing. A menu in the UV editor allows a square UV to be snapped to an Atlas texture, or Tilemap. P

2 Jul 17, 2022
This is a Fava extension to display a grouped portfolio view in Fava for a set of Beancount accounts.

Fava Portfolio Summary This is a Fava extension to display a grouped portfolio view in Fava for a set of Beancount accounts. It can also calculate MWR

18 Dec 26, 2022
addon for blender to import mocap data from tools like easymocap, frankmocap and Vibe

b3d_mocap_import addon for blender to import mocap data from tools like easymocap, frankmocap and Vibe ==================VIBE================== To use

Carlos Barreto 97 Dec 07, 2022
Basic Hspice runner with Python

HSpicePy Bilgisayarınıza PATH değişkenlerine eklediğiniz HSPICE programını python ile çalıştırmanızı sağlayan basit bir araç. A simple tool that allow

1 Nov 16, 2021
A module comment generator for python

Module Comment Generator The comment style is as a tribute to the comment from the RA . The comment generator can parse the ast tree from the python s

飘尘 1 Oct 21, 2021
Add all JuliaLang unicode abbreviations to AutoKey.

Autokey Unicode characters Usage This script adds all the unicode character abbreviations supported by Julia to autokey. However, instead of [TAB], th

Randolf Scholz 49 Dec 02, 2022
A student information management system in Python

Student-information-management-system 本项目是一个学生信息管理系统,这个项目是用Python语言实现的,也实现了图形化界面的显示,同时也实现了管理员端,学生端两个登陆入口,同时底层使用的是Redis做的数据持久化。 This project is a stude

liuyunfei 7 Nov 15, 2022
Autogenerador tonto de paquetes para ROSCPP

Autogenerador tonto de paquetes para ROSCPP Autogenerador de paquetes que usan C++ en ROS. Por ahora tiene las siguientes capacidades: Permite crear p

1 Nov 26, 2021