Code to compute permutation and drop-column importances in Python scikit-learn models

Overview

Feature importances for scikit-learn machine learning models

By Terence Parr and Kerem Turgutlu. See Explained.ai for more stuff.

The scikit-learn Random Forest feature importances strategy is mean decrease in impurity (or gini importance) mechanism, which is unreliable. To get reliable results, use permutation importance, provided in the rfpimp package in the src dir. Install with:

pip install rfpimp

We include permutation and drop-column importance measures that work with any sklearn model. Yes, rfpimp is an increasingly-ill-suited name, but we still like it.

Description

See Beware Default Random Forest Importances for a deeper discussion of the issues surrounding feature importances in random forests (authored by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard).

The mean-decrease-in-impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within random forests. The problem is that this mechanism, while fast, does not always give an accurate picture of importance. Strobl et al pointed out in Bias in random forest variable importance measures: Illustrations, sources and a solution that “the variable importance measures of Breiman's original random forest method ... are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.”

A more reliable method is permutation importance, which measures the importance of a feature as follows. Record a baseline accuracy (classifier) or R2 score (regressor) by passing a validation set or the out-of-bag (OOB) samples through the random forest. Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R2. The importance of that feature is the difference between the baseline and the drop in overall accuracy or R2 caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.

Sample code

See the notebooks directory for things like Collinear features and Plotting feature importances.

Here's some sample Python code that uses the rfpimp package contained in the src directory. The data can be found in rent.csv, which is a subset of the data from Kaggle's Two Sigma Connect: Rental Listing Inquiries competition.

from rfpimp import *
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

df_orig = pd.read_csv("/Users/parrt/github/random-forest-importances/notebooks/data/rent.csv")

df = df_orig.copy()

# attentuate affect of outliers in price
df['price'] = np.log(df['price'])

df_train, df_test = train_test_split(df, test_size=0.20)

features = ['bathrooms','bedrooms','longitude','latitude',
            'price']
df_train = df_train[features]
df_test = df_test[features]

X_train, y_train = df_train.drop('price',axis=1), df_train['price']
X_test, y_test = df_test.drop('price',axis=1), df_test['price']
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test) # permutation
viz = plot_importances(imp)
viz.view()


df_train, df_test = train_test_split(df_orig, test_size=0.20)
features = ['bathrooms','bedrooms','price','longitude','latitude',
            'interest_level']
df_train = df_train[features]
df_test = df_test[features]

X_train, y_train = df_train.drop('interest_level',axis=1), df_train['interest_level']
X_test, y_test = df_test.drop('interest_level',axis=1), df_test['interest_level']
# Add column of random numbers
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))

rf = RandomForestClassifier(n_estimators=100,
                            min_samples_leaf=5,
                            n_jobs=-1,
                            oob_score=True)
rf.fit(X_train, y_train)

imp = importances(rf, X_test, y_test, n_samples=-1)
viz = plot_importances(imp)
viz.view()

Feature correlation

See Feature collinearity heatmap. We can get the Spearman's correlation matrix:

Feature dependencies

The features we use in machine learning are rarely completely independent, which makes interpreting feature importance tricky. We could compute correlation coefficients, but that only identifies linear relationships. A way to at least identify if a feature, x, is dependent on other features is to train a model using x as a dependent variable and all other features as independent variables. Because random forests give us an easy out of bag error estimate, the feature dependence functions rely on random forest models. The R^2 prediction error from the model indicates how easy it is to predict feature x using the other features. The higher the score, the more dependent feature x is.

You can also get a feature dependence matrix / heatmap that returns a non-symmetric data frame where each row is the importance of each var to the row's var used as a model target. Example:

Comments
  • SyntaxError: invalid syntax

    SyntaxError: invalid syntax

    When import rfpimp, there is an error like below

    " File "/Users/yan/anaconda/lib/python3.5/site-packages/rfpimp.py", line 518 ax.xaxis.set_major_formatter(FormatStrFormatter(f'%.{xtick_precision}f')) ^ SyntaxError: invalid syntax "

    opened by Yanjiayork 10
  • Incorrect references to sklearn?

    Incorrect references to sklearn?

    Hello,

    I have rfpimp ver 1.3.6 installed as well as sklearn 0.24.1. When I ran a script that used them, I got this error File "C:...\anaconda3\envs...\lib\site-packages\rfpimp.py", line 16, in from sklearn.ensemble.forest import _generate_unsampled_indices ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

    I dug into it and found that sklearn.ensemble.forest is, in my version, sklearn.ensemble._forest and _generate_unsampled_indices does reside there. While it's possible that something is wrong on my end, my guess is that sklearn has changed? I may change rfpimp.py on my own to match sklearn. I hope it doesn't break my computer. Thanks!

    compatibility 
    opened by mgandaman 8
  • TypeError: barh() missing 1 required positional argument: 'bottom'

    TypeError: barh() missing 1 required positional argument: 'bottom'

    Installed the package and working thru your Classifier example on rents. https://github.com/parrt/random-forest-importances/blob/master/notebooks/permutation-importances-classifier.ipynb

    Immediately get the missing argument error in block 3 of the notebook. have searched the rfImp functions and do not see where it could be missing.

    lack of activity 
    opened by TNFCFA 7
  • 'deep' is an invalid keyword argument for this function

    'deep' is an invalid keyword argument for this function

    I am trying to get the feature importance of my random forest model but i keep getting the following error:

    'deep' is an invalid keyword argument for this function

    Below is the entire error output:

    TypeError Traceback (most recent call last) in () 1 #getting imortance for features using permutation importance 2 ----> 3 perm_imp_rfpimp_rf10 = permutation_importances(rf_10, train_features_x, train_labels_y, rdt10) 4 perm_imp_rfpimp_rf100 = permutation_importances(rf_100, train_features_x, train_labels_y, rdt100) 5 perm_imp_rfpimp_rf1000 = permutation_importances(rf_1000, train_features_x, train_labels_y, rdt1000)

    ~/anaconda2/envs/py36/lib/python3.6/site-packages/rfpimp.py in permutation_importances(rf, X_train, y_train, metric, n_samples) 286 287 def permutation_importances(rf, X_train, y_train, metric, n_samples=5000): --> 288 imp = permutation_importances_raw(rf, X_train, y_train, metric, n_samples) 289 I = pd.DataFrame(data={'Feature':X_train.columns, 'Importance':imp}) 290 I = I.set_index('Feature')

    ~/anaconda2/envs/py36/lib/python3.6/site-packages/rfpimp.py in permutation_importances_raw(rf, X_train, y_train, metric, n_samples) 403 404 baseline = metric(rf, X_sample, y_sample) --> 405 X_train = X_sample.copy(deep=False,axes=True) # shallow copy 406 y_train = y_sample 407 imp = []

    TypeError: 'deep' is an invalid keyword argument for this function

    My inputs involve providing a function based metric as below:

    def rdt10(rf_10,train_features_x, train_labels_y): return r2_score(train_labels_y, rf_10.predict(train_features_x))

    def rdt100(rf_100,train_features_x, train_labels_y): return r2_score(train_labels_y, rf_100.predict(train_features_x))

    def rdt1000(rf_1000,train_features_x, train_labels_y): return r2_score(train_labels_y, rf_1000.predict(train_features_x))

    and then calling it in the permutation importance function below (this is what gives the error output from above):

    perm_imp_rfpimp_rf10 = permutation_importances(rf_10, train_features_x, train_labels_y, rdt10) perm_imp_rfpimp_rf100 = permutation_importances(rf_100, train_features_x, train_labels_y, rdt100) perm_imp_rfpimp_rf1000 = permutation_importances(rf_1000, train_features_x, train_labels_y, rdt1000)

    rf_10, rf_100, rf_1000 are my random forest models using 10, 100, and 1000 estimators.

    Please help me figure out how to address this error:

    can't reproduce 
    opened by ebuka-nweke 6
  • Add missing arg to _generate_unsampled_indices

    Add missing arg to _generate_unsampled_indices

    Fixes #27

    In sklearn 0.22 sklearn._forest._generate_unsampled_indices(random_state, n_samples) changed signature to sklearn._forest._generate_unsampled_indices(random_state, n_samples, n_samples_bootstrap).

    I used the sklearn._forest._get_n_samples_bootstrap(n_samples, n_samples) with the same number of samples and passes to the new arg just to avoid raising the exception.

    compatibility 
    opened by matheusccouto 6
  • Incompatible with latest version of sklearn

    Incompatible with latest version of sklearn

    In https://github.com/scikit-learn/scikit-learn/pull/14964, modules in ensemble have been made private, breaking this line of code (forest has become _forest).

    compatibility 
    opened by yuchaoran2011 5
  • what's the difference between rfpimp.importances and rfpimp.permutation_importances?

    what's the difference between rfpimp.importances and rfpimp.permutation_importances?

    I noticed in the README that rfpimp.importances was used, whereas in this blog they used from rfpimp import permutation_importances.

    On an unrelated note, I tried both this repo's implementation of permutation importance and also eli5's implementation and got very different results. If anyone has tried both before I would like to hear your experience.

    question 
    opened by hmanz 5
  • SyntaxError in python2.7

    SyntaxError in python2.7

    Syntax error in python2.7 (it does work python3). If rfpimp is not supposed to work in 2.7, you might want to consider mentioning it in the README

    import rfpimp

    File "/Users/diegomazon/anaconda/lib/python2.7/site-packages/rfpimp.py", line 40 self.svgfilename = f"{tmp}/PimpViz_{getpid()}.svg" ^ SyntaxError: invalid syntax

    portability 
    opened by diego-mazon 5
  • Error with oob_importances with scikit-learn 0.22.1

    Error with oob_importances with scikit-learn 0.22.1

    oob_importances internally uses _generate_unsampled_indices which is a private function within scikit-learn. In scikit-learn 0.22.1 the function signature of _generate_unsampled_indices has changed from _generate_unsampled_indices(random_state, n_samples) to _generate_unsampled_indices(random_state, n_samples, n_samples_bootstrap) . This signature change can be seen here

    compatibility 
    opened by mkhan037 4
  • Feature correlation p-values and correction methods

    Feature correlation p-values and correction methods

    Wanted to get the conversation open on feature correlation, right now it just does a naive spearmanr, with no insight into the resulting p-values. Would be great to do a few things, listed below in order of importance:

    1. Introduce p-values and maybe apply the appropriate cutoffs
    2. Introduce permutation based correlation, starting off with lagged correlations for example (context is time series analysis)
    3. Introduce a probability correction method for 1 and/or 2 such as bonferroni, to account for the number of correlation estimates we're doing between features and between number of lags if we end up implementing #2.

    Happy to get the conversation going and see where we end up. Right now the feature correlation estimation is not quite stable in the context of very noisy time series data.

    enhancement 
    opened by feribg 4
  • ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

    ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

    Hello, I am trying to import rfpimp however I am met by the error:

    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-133-c95d15dec9fe> in <module>
         24 import matplotlib.patheffects as PathEffects
         25 from pandas.plotting import lag_plot
    ---> 26 from rfpimp import *
         27 
         28 # Machine Learning libraries
    
    ~/opt/anaconda3/lib/python3.8/site-packages/rfpimp.py in <module>
         13 from sklearn.ensemble import RandomForestClassifier
         14 from sklearn.ensemble import RandomForestRegressor
    ---> 15 from sklearn.ensemble.forest import _generate_unsampled_indices
         16 from sklearn.ensemble import forest
         17 from sklearn.model_selection import cross_val_score
    
    ModuleNotFoundError: No module named 'sklearn.ensemble.forest'
    

    It seems that sklearn.ensemble.forest was renamed to sklearn.ensemble._forest (see here)

    I'd have to install an older version for sklearn however that would break other dependencies I have. Is there a fix around this? Thanks

    opened by ziadzee 3
  • What's the meaning of <0 values?

    What's the meaning of <0 values?

    Just tested the code from index page, some of my features have negative values, does it mean reverse-related to target feature or something else? Thank you.

    opened by fisherss 0
  • Questions Regarding Alternative Feature Importance

    Questions Regarding Alternative Feature Importance

    How many other forms of feature importance are there, and how are they different from one another?

    • Shapley-based
      • https://github.com/slundberg/shap
      • https://github.com/iancovert/sage
    • LOFO https://github.com/aerdem4/lofo-importance
    • LIME https://github.com/marcotcr/lime
    • Gini and Split https://github.com/shionhonda/feature-importance
    • Permutation https://github.com/nestordemeure/permutationImportance
    • "Unbiased" https://github.com/ZhengzeZhou/unbiased-feature-importance
    • Morris, and Partial Dependence https://github.com/interpretml/interpret#supported-techniques

    P.S. This repo's design is absurd https://github.com/ModelOriented/DALEX

    opened by BrandonKMLee 0
  • An error occurred when the test file was run

    An error occurred when the test file was run

    I got an error running "permutation-importances-classifier", “forest” seems to be updated to “_forest” in sklearn. I changed "from sklearn.ensemble.forest import _generate_unsampled_indices" to "from sklearn.ensemble._forest import _generate_unsampled_indices" and it worked fine.

    In the same code, "unsampled_indices = _generate_unsampled_indices(tree.random_state, n_samples)" shows missing "TypeError: _generate_unsampled_indices() missing 1 required positional argument: 'n_samples_bootstrap'" when running. The function of _generate_unsampled_indices is defined as: "def _generate_unsampled_indices(random_state, n_samples, n_samples_bootstrap):".

    opened by LilWei-DU 1
  • Varying Dependency Value

    Varying Dependency Value

    When I use "feature_dependence_matrix" function to get the dependency of each independent variables, the values change every time I run the code. Specifying the number of random_state only allow me to obtain constant overall dependency regardless how many times I run the code, but the individual dependency is still changing.

    Is there any way I could obtain fix individual dependency values every time?

    Thanks!

    opened by Joprou 0
  • AttributeError: 'numpy.ndarray' object has no attribute 'columns'

    AttributeError: 'numpy.ndarray' object has no attribute 'columns'

    File "D:\python\lib\site-packages\rfpimp.py", line 143, in importances features = X_valid.columns.values AttributeError: 'numpy.ndarray' object has no attribute 'columns'

    opened by VincentOld 1
Releases(1.3.7)
  • 1.3(Oct 22, 2018)

    • Added plot_dependence_heatmap() to plot feature dependence heat maps
    • Improve feature importance plots so that the bars are always the same. You can specify a title and there is better scaling support.
    • The plotting routines return PimpViz objects that by default render the current matplotlib image via SVG, getting a much sharper image than the default PNG.
    • dropcol importance was relying on OOB scores instead of the more general model scoring/metric.
    • Added a stemplot version that mimics the bar chart for feature importance.
    • Added precision argument to the correlation heat map function.
    • Rebuilt the notebook examples and the ones that generate images for the paper.
    • Added a section to the paper that shows the feature dependence heat map applied to the breast-cancer data set.
    Source code(tar.gz)
    Source code(zip)
Owner
Terence Parr
Creator of the ANTLR parser generator. Professor at Univ of San Francisco, computer science and data science. Working mostly on machine learning stuff now.
Terence Parr
BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology) 12 Apr 27, 2022
This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

Patches Are All You Need? 🤷 This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?". Code ov

ICLR 2022 Author 934 Dec 30, 2022
Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Peter Lin 6.5k Jan 04, 2023
Liver segmentation using MONAI and pytorch

Machine Learning use case in the field of Healthcare. In this project MONAI and pytorch frameworks are used for 3D Liver segmentation.

Abhishek Gajbhiye 2 May 30, 2022
A simple Rock-Paper-Scissors game using CV in python

ML18_Rock-Paper-Scissors-using-CV A simple Rock-Paper-Scissors game using CV in python For IITISOC-21 Rules and procedure to play the interactive game

Anirudha Bhagwat 3 Aug 08, 2021
CT Based COVID 19 Diagnose by Image Processing and Deep Learning

This project proposed the deep learning and image processing method to undertake the diagnosis on 2D CT image and 3D CT volume.

1 Feb 08, 2022
PyTorch Implementation of the SuRP algorithm by the authors of the AISTATS 2022 paper "An Information-Theoretic Justification for Model Pruning"

PyTorch Implementation of the SuRP algorithm by the authors of the AISTATS 2022 paper "An Information-Theoretic Justification for Model Pruning".

Berivan Isik 8 Dec 08, 2022
Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning" This is the code for the paper Solving Graph-based Public Goo

Victor-Alexandru Darvariu 3 Dec 05, 2022
[ICML 2021] A fast algorithm for fitting robust decision trees.

GROOT: Growing Robust Trees Growing Robust Trees (GROOT) is an algorithm that fits binary classification decision trees such that they are robust agai

Cyber Analytics Lab 17 Nov 21, 2022
A Kaggle competition: discriminate gender based on handwriting

Gender discrimination based on handwriting See http://fastml.com/gender-discrimination/ for description. prep_data.py - a first step chunk_by_authors.

Zygmunt Zając 22 Jul 20, 2022
Submodular Subset Selection for Active Domain Adaptation (ICCV 2021)

S3VAADA: Submodular Subset Selection for Virtual Adversarial Active Domain Adaptation ICCV 2021 Harsh Rangwani, Arihant Jain*, Sumukh K Aithal*, R. Ve

Video Analytics Lab -- IISc 13 Dec 28, 2022
Monocular 3D Object Detection: An Extrinsic Parameter Free Approach (CVPR2021)

Monocular 3D Object Detection: An Extrinsic Parameter Free Approach (CVPR2021) Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, Qinhong Jia

Yunsong Zhou 51 Dec 14, 2022
It's A ML based Web Site build with python and Django to find the breed of the dog

ML-Based-Dog-Breed-Identifier This is a Django Based Web Site To Identify the Breed of which your DOG belogs All You Need To Do is to Follow These Ste

Sanskar Dwivedi 2 Oct 12, 2022
Code for the paper Hybrid Spectrogram and Waveform Source Separation

Demucs Music Source Separation This is the 3rd release of Demucs (v3), featuring hybrid source separation. For the waveform only Demucs (v2): Go this

Meta Research 4.8k Jan 04, 2023
Projecting interval uncertainty through the discrete Fourier transform

Projecting interval uncertainty through the discrete Fourier transform This repo

1 Mar 02, 2022
A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swar.

Omni-swarm A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swarm Introduction Omni-swarm is a decentralized omn

HKUST Aerial Robotics Group 99 Dec 23, 2022
AFL binary instrumentation

E9AFL --- Binary AFL E9AFL inserts American Fuzzy Lop (AFL) instrumentation into x86_64 Linux binaries. This allows binaries to be fuzzed without the

242 Dec 12, 2022
PyTorch implementation of our paper How robust are discriminatively trained zero-shot learning models?

How robust are discriminatively trained zero-shot learning models? This repository contains the PyTorch implementation of our paper How robust are dis

Mehmet Kerim Yucel 5 Feb 04, 2022
Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models

LMPBT Supplementary code for the Paper entitled ``Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models"

1 Sep 29, 2022
Code for paper [ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot] (ICCV 2021, oral))

ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot This repository is the official PyTorch implementation of ICCV-21 pape

Jiarui 21 May 09, 2022