当前位置：网站首页>3 Feature Binning Methods!

3 Feature Binning Methods!

2022-08-09 18:45:00 【Junhong's data analysis road】

一般在建立分类模型时,When we go on characteristics of the engineering work often need to deal with the discretization of continuous variables,Is the continuous field into discrete field.

In the process of discretization,A recap of the continuous variable coding.特征离散化后,模型会更稳定,降低了模型过拟合的风险.本文主要介绍3A common feature points method：

Points box features

Continuous variables discretization points of operation on,Can present more concise data information
Elimination of the influence of characteristic variable dimension,Because after points are category number,例如：0,1,2...
To a certain extent, reduce the influence of outliers,对异常数据有很强的鲁棒性

模拟数据

Simulation of a simple data and incomeINCOME相关

In [1]:

import pandas as pd
import numpy as np

In [2]:

df = pd.DataFrame({"ID":range(10),
                  "INCOME":[0,10,20,150,35,78,50,49,88,14]})
df

sklearn之KBinsDiscretizer类

本文中介绍的3Of operation are based onsklearn中的KBinsDiscretizer类,官网学习地址：

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

from sklearn.preprocessing import KBinsDiscretizer

sklearn.preprocessing.KBinsDiscretizer(n_bins=5, 
                                       encode='onehot', 
                                       strategy='quantile', 
                                       dtype=None, 
                                       subsample='warn', 
                                       random_state=None)

全部参数解释：

All attribute information：

重点解释3个参数的使用：

n_bins

参数n_binsThe number of parameters specified on need points,默认是5个

strategy

To specify different points strategystrategy：KBinsDiscretizerClass implements the different points of the strategy,可以通过参数strategy进行选择：

等宽：uniform Strategy with fixed widthbins;The width of the box body is consistent
等频：quantile Strategy used on each feature quantile(quantiles)Value in order to have the same fillingbins
聚类：kmeans Strategy based on independent execution on each featurek-meansClustering process definitionbins.

encode

encode参数表示分箱后Discrete field if you need further hot coding or other coding processing alone

KBinsDiscretizerClass can only identify column,需要将DataFrame的数据进行转化：

In [3]:

income = np.array(df["INCOME"].tolist()).reshape(-1,1)
income

Out[3]:

array([[  0],
       [ 10],
       [ 20],
       [150],
       [ 35],
       [ 78],
       [ 50],
       [ 49],
       [ 88],
       [ 14]])

Before using the guide in：

In [4]:

from sklearn.preprocessing import KBinsDiscretizer

等宽分箱

So-called box is wide, such as, the data is divided into such as the width of a few,Such as analog data inINCOME的范围是0-150.Now its width divided into3份,So is the scope of each corresponding values：[0,50),[50,100)[100,150]

In [5]:

from sklearn.preprocessing import KBinsDiscretizer

dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="uniform"
                      )
dis

Out[5]:

KBinsDiscretizer(encode='ordinal', n_bins=3, strategy='uniform')

In [6]:

label_uniform = dis.fit_transform(income)  # 转换器
label_uniform

Out[6]:

array([[0.],
       [0.],
       [0.],
       [2.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.]])

Width border check points：

In [7]:

dis.bin_edges_

Out[7]:

array([array([  0.,  50., 100., 150.])], dtype=object)

In [8]:

dis.n_bins

Out[8]:

等频分箱

Such as frequency division is the general term used to describe each band contains the value of number is the same,The difference between and width box：

等频分箱：每个区间内包括的值一样多,pd.qcut
等宽分箱：每两区间之间的距离是一样的,pd.cut

Before the implementation of frequency division box,我们需要先对数据进行升序排列,Then take the median points

In [9]:

# 1、先排序
sort_df = sorted(df["INCOME"])
sort_df

Out[9]:

[0, 10, 14, 20, 35, 49, 50, 78, 88, 150]

分成2个类别

In [10]:

# 2、中间值：35和49的均值
(35 + 49) / 2

Out[10]:

42.0

下面我们以42As the basis of frequency division such as box：

In [11]:

dis = KBinsDiscretizer(n_bins=2,
                       encode="ordinal",
                       strategy="quantile"
                      )

dis.fit_transform(income)  # 转换器

Out[11]:

array([[0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.]])

In [12]:

dis.bin_edges_

Out[12]:

array([array([  0.,  42., 150.])], dtype=object)

分成3个类别

总共是10个元素,分成3个类,10/3=3...1,前面两个3个元素,最后一个是4个元素,The last case contains the elements of the remainder part：

In [13]:

dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="quantile"
                      )

label_quantile = dis.fit_transform(income)  # 转换器
label_quantile

Out[13]:

array([[0.],
       [0.],
       [1.],
       [2.],
       [1.],
       [2.],
       [2.],
       [1.],
       [2.],
       [0.]])

In [14]:

dis.bin_edges_  # Points boundary

Out[14]:

array([array([  0.,  20.,  50., 150.])], dtype=object)

In [15]:

sort_df  # 排序后的数据

Out[15]:

[0, 10, 14, 20, 35, 49, 50, 78, 88, 150]

聚类分箱

Clustering points box refers to the first type of continuous variable clustering,Then the sample classes as a logo to replace the original value.

In [16]:

from sklearn import cluster

In [17]:

kmeans = cluster.KMeans(n_clusters=3)

kmeans.fit(income)

Out[17]:

KMeans(n_clusters=3)

Clustering is completed for each sample belongs to category：

In [18]:

kmeans.labels_

Out[18]:

array([1, 1, 1, 2, 1, 0, 0, 0, 0, 1], dtype=int32)

使用KBinsDiscretizerTo implement clustering points：

In [19]:

dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="kmeans"
                      )

label_kmeans = dis.fit_transform(income)  # 转换器
label_kmeans

Out[19]:

array([[0.],
       [0.],
       [0.],
       [2.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.]])

In [20]:

dis.bin_edges_  # Points boundary

Out[20]:

array([array([  0.        ,  54.21428571, 116.5       , 150.        ])],
      dtype=object)

3种方法对比

In [21]:

df["label_uniform"] = label_uniform
df["label_quantile"] = label_quantile
df["label_kmeans"] = label_kmeans

df

参考

特征离散化（分箱）综述：https://zhuanlan.zhihu.com/p/68865422
书籍《特征工程入门与实践》
sklearn官网

原网站

版权声明
本文为[Junhong's data analysis road]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/221/202208091533554986.html