当前位置:网站首页>[machine learning] Note 4. KNN + cross validation
[machine learning] Note 4. KNN + cross validation
2022-04-23 13:41:00 【Ruo Xiaoyu】
KNN Classification model
- Concept : In short ,K- Neighbor algorithm uses the method of measuring the distance between different eigenvalues to classify (k-Nearest Neighbor ,KNN)
- k The role of value

- Euclid distance

- stay scikit-learn In the library k- Nearest neighbor algorithm
# Iris classification implementation
import sklearn.datasets as ds
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
# 4、 Observation data set : See if you need to deal with Feature Engineering
x_train.shape
# 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
# Super parameters of the model : If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model
knn = KNeighborsClassifier(n_neighbors=3)#n_neighbors == K
# 6、 Training model with training set data
# X: Characteristic data of training set . The dimension of feature data must be two-dimensional
# Y: Tag data of training set
knnModel = knn.fit(x_train,y_train)
print(knnModel)
# 7、 test model : Using test data
y_pred= knnModel.predict(x_test) # The model predicts the returned classification results based on the test data
y_true = y_test # The real classification results of the test set
# 7.1、 Compare the model classification results with the real classification results
# The classification results of the model : [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 2 0 0]
# It's really a classification result : [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 1 0 0]
print(' The classification results of the model :',y_pred)
print(' It's really a classification result :',y_true)
# 7.2、 Calculate the prediction accuracy of the model
# Enter the reference X: Characteristic data of the test set
# Enter the reference Y: Label data for test set
score = knnModel.score(x_test,y_test)
# Model prediction accuracy
# score= 0.9666666666666667
print('score=',score)
# 8、 Use the model to predict the target data set
targetResult = knnModel.predict([[6.1,5.1,4.5,3.6],[2.1,3.1,4.5,5.6]])
# Predicted results : [2 2]
print(' Predicted results :',targetResult)
- How to choose the best k value
- Learning curve to find the best k value
# Find the best through the learning curve K value
import numpy as np
scores = []
ks = []
# For example, we try k The value range is 1 to 50
for i in range(1,50):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)# Training models
score = knn.score(x_test,y_test)# Calculate the prediction accuracy of the model
ks.append(i)
scores.append(score)
scores_arr = np.array(scores)
ks_arr = np.array(ks)
# %matplotliib inline
import matplotlib.pyplot as plt
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_value')
plt.ylabel('score')
plt.show()

K Crossover verification
- Purpose :
- Select the most suitable value of model superparameters , Then the value of the super parameter is applied to the creation of the model .
- thought :
- Cross split the training data of the sample into different training sets and verification sets , Cross split different training sets and verification sets are used to test the accuracy of the model , Then the average of the accuracy is the result of cross validation here . Cross validation into different super parameters , Select the super parameter with the highest accuracy as the super parameter of the model .
- Realize the idea
- Divide the data set evaluation into K Equal parts
- Use 1 Data as test data , The rest are used as training data
- Calculate the test accuracy
- Use different test sets , repeat 2、3 step
- Average the accuracy , As an estimate of the prediction accuracy of unknown data

- API
- from sklearn.model_selection import cross_val_score
- cross_val_score(estimator,X,y,cv)
- estimator: Model object
- X,y; Training set data
- cv: Discount
- Cross validation in KNN Basic use in
# Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
# 4、 Observation data set : See if you need to deal with Feature Engineering
x_train.shape
# 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
# Super parameters of the model : If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model
knn = KNeighborsClassifier(n_neighbors=5)#n_neighbors == K
# Cross verify the training set
crossScore= cross_val_score(knn,x_train,y_train,cv=5).mean()
print(crossScore)
- Use cross validation & Learning curve to find the optimal hyperparameter
# The learning curve & Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
import matplotlib.pyplot as plt
import numpy as np
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
scores = []
ks = []
for k in range(3,20):
knn = KNeighborsClassifier(n_neighbors=k)
cross_score = cross_val_score(knn,x_train,y_train,cv=6).mean()
scores.append(cross_score)
ks.append(k)
ks_arr = np.array(ks)
scores_arr = np.array(scores)
plt.plot(ks_arr, scores_arr)
plt.xlabel('k_value')
plt.ylabel('score')
plt.show()
# Take the subscript of the maximum value
max_idx = scores_arr.argmax()
# The maximum value corresponds to k value
max_k = ks[max_idx]
# The maximum subscript : 4
print(' The maximum subscript :',max_idx)
# The maximum value corresponds to k value : 7
print(' The maximum value corresponds to k value :',max_k)

- Cross validation can help us make model selection
# The learning curve & Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
# # Model selection using cross validation
from sklearn.linear_model import LogisticRegression
# knn Model
knn = KNeighborsClassifier(n_neighbors=7)
# KNN Model accuracy 0.9916666666666666
print('KNN Model accuracy ',cross_val_score(knn,x_train,y_train,cv=10).mean())
# lr Model
lr = LogisticRegression()
# LR Model accuracy 0.9833333333333332
print('LR Model accuracy ',cross_val_score(lr,x_train,y_train,cv=10).mean())
K-Fold( As an understanding )
- Scikit Provided in K-Fold Of API
- n-split It's the discount
- shuffle Refers to whether to shuffle data
- random_state For random seeds , Fixed randomness
from numpy import array
from sklearn.model_selection import KFold
if __name__ == '__main__':
data = array([0.1,0.2,0.3,0.4,0.5,0.6])
kfold = KFold(n_splits=3,shuffle=True,random_state=1)
for train,test in kfold.split(data):
# train: [0.1 0.4 0.5 0.6],test:[0.2 0.3]
# train: [0.2 0.3 0.4 0.6],test:[0.1 0.5]
# train: [0.1 0.2 0.3 0.5],test:[0.4 0.6]
print('train: %s,test:%s' % (data[train],data[test]))
- Scikit Cross validation interface in sklearn.model_selection.cross_val_score, But the interface has no data shuffle function , So the general combination Kfold Use it together . If Train The data has passed before grouping shuffle Handle , For example, use train_test_split grouping , Then you can use it directly cross_val_score Interface
# Cross validation combined with Kfold
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
from sklearn.neighbors import KNeighborsClassifier
iris = ds.load_iris()
X,y = iris.data,iris.target
knn = KNeighborsClassifier(n_neighbors=5)
n_folds = 5
# Randomly split the training data
kf = KFold(n_folds,shuffle=True,random_state=42).get_n_splits(X)
# Cross validation
scores = cross_val_score(knn,X,y,cv=kf)
print(scores.mean())
版权声明
本文为[Ruo Xiaoyu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230602187134.html
边栏推荐
- 解决tp6下载报错Could not find package topthink/think with stability stable.
- Bottomsheetdialogfragment + viewpager + fragment + recyclerview sliding problem
- Machine learning -- PCA and LDA
- [point cloud series] neural opportunity point cloud (NOPC)
- Modification of table fields by Oracle
- Launcher hides app icons that do not need to be displayed
- TERSUS笔记员工信息516-Mysql查询(2个字段的时间段唯一性判断)
- What does the SQL name mean
- Software test system integration project management engineer full truth simulation question (including answer and analysis)
- PyTorch 21. NN in pytorch Embedding module
猜你喜欢

叮~ 你的奖学金已到账!C认证企业奖学金名单出炉

Oracle job scheduled task usage details

Solve the problem that Oracle needs to set IP every time in the virtual machine

Zero copy technology

Short name of common UI control

切线空间(tangent space)

Unified task distribution scheduling execution framework

Campus takeout system - "nongzhibang" wechat native cloud development applet

Set Jianyun x Feishu Shennuo to help the enterprise operation Department realize office automation

零拷贝技术
随机推荐
Lpddr4 notes
NPM err code 500 solution
为什么从事云原生开发需要学习容器技术
Oracle job scheduled task usage details
Solve the problem that Oracle needs to set IP every time in the virtual machine
Machine learning -- naive Bayes
Interface idempotency problem
Unified task distribution scheduling execution framework
Use of GDB
Oracle index status query and index reconstruction
Bottomsheetdialogfragment + viewpager + fragment + recyclerview sliding problem
Detailed explanation of constraints of Oracle table
TCP 复位gongji原理和实战复现
Filter and listener of three web components
Double pointer instrument panel reading (I)
[point cloud series] full revolutionary geometric features
[dynamic programming] 221 Largest Square
Antd design form verification
Set Jianyun x Feishu Shennuo to help the enterprise operation Department realize office automation
Example of specific method for TIA to trigger interrupt ob40 based on high-speed counter to realize fixed-point machining action