当前位置:网站首页>[machine learning] Note 4. KNN + cross validation
[machine learning] Note 4. KNN + cross validation
2022-04-23 13:41:00 【Ruo Xiaoyu】
KNN Classification model
- Concept : In short ,K- Neighbor algorithm uses the method of measuring the distance between different eigenvalues to classify (k-Nearest Neighbor ,KNN)
- k The role of value
- Euclid distance
- stay scikit-learn In the library k- Nearest neighbor algorithm
# Iris classification implementation
import sklearn.datasets as ds
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
# 4、 Observation data set : See if you need to deal with Feature Engineering
x_train.shape
# 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
# Super parameters of the model : If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model
knn = KNeighborsClassifier(n_neighbors=3)#n_neighbors == K
# 6、 Training model with training set data
# X: Characteristic data of training set . The dimension of feature data must be two-dimensional
# Y: Tag data of training set
knnModel = knn.fit(x_train,y_train)
print(knnModel)
# 7、 test model : Using test data
y_pred= knnModel.predict(x_test) # The model predicts the returned classification results based on the test data
y_true = y_test # The real classification results of the test set
# 7.1、 Compare the model classification results with the real classification results
# The classification results of the model : [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 2 0 0]
# It's really a classification result : [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 1 0 0]
print(' The classification results of the model :',y_pred)
print(' It's really a classification result :',y_true)
# 7.2、 Calculate the prediction accuracy of the model
# Enter the reference X: Characteristic data of the test set
# Enter the reference Y: Label data for test set
score = knnModel.score(x_test,y_test)
# Model prediction accuracy
# score= 0.9666666666666667
print('score=',score)
# 8、 Use the model to predict the target data set
targetResult = knnModel.predict([[6.1,5.1,4.5,3.6],[2.1,3.1,4.5,5.6]])
# Predicted results : [2 2]
print(' Predicted results :',targetResult)
- How to choose the best k value
- Learning curve to find the best k value
# Find the best through the learning curve K value
import numpy as np
scores = []
ks = []
# For example, we try k The value range is 1 to 50
for i in range(1,50):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)# Training models
score = knn.score(x_test,y_test)# Calculate the prediction accuracy of the model
ks.append(i)
scores.append(score)
scores_arr = np.array(scores)
ks_arr = np.array(ks)
# %matplotliib inline
import matplotlib.pyplot as plt
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_value')
plt.ylabel('score')
plt.show()
K Crossover verification
- Purpose :
- Select the most suitable value of model superparameters , Then the value of the super parameter is applied to the creation of the model .
- thought :
- Cross split the training data of the sample into different training sets and verification sets , Cross split different training sets and verification sets are used to test the accuracy of the model , Then the average of the accuracy is the result of cross validation here . Cross validation into different super parameters , Select the super parameter with the highest accuracy as the super parameter of the model .
- Realize the idea
- Divide the data set evaluation into K Equal parts
- Use 1 Data as test data , The rest are used as training data
- Calculate the test accuracy
- Use different test sets , repeat 2、3 step
- Average the accuracy , As an estimate of the prediction accuracy of unknown data
- API
- from sklearn.model_selection import cross_val_score
- cross_val_score(estimator,X,y,cv)
- estimator: Model object
- X,y; Training set data
- cv: Discount
- Cross validation in KNN Basic use in
# Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
# 4、 Observation data set : See if you need to deal with Feature Engineering
x_train.shape
# 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
# Super parameters of the model : If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model
knn = KNeighborsClassifier(n_neighbors=5)#n_neighbors == K
# Cross verify the training set
crossScore= cross_val_score(knn,x_train,y_train,cv=5).mean()
print(crossScore)
- Use cross validation & Learning curve to find the optimal hyperparameter
# The learning curve & Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
import matplotlib.pyplot as plt
import numpy as np
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
scores = []
ks = []
for k in range(3,20):
knn = KNeighborsClassifier(n_neighbors=k)
cross_score = cross_val_score(knn,x_train,y_train,cv=6).mean()
scores.append(cross_score)
ks.append(k)
ks_arr = np.array(ks)
scores_arr = np.array(scores)
plt.plot(ks_arr, scores_arr)
plt.xlabel('k_value')
plt.ylabel('score')
plt.show()
# Take the subscript of the maximum value
max_idx = scores_arr.argmax()
# The maximum value corresponds to k value
max_k = ks[max_idx]
# The maximum subscript : 4
print(' The maximum subscript :',max_idx)
# The maximum value corresponds to k value : 7
print(' The maximum value corresponds to k value :',max_k)
- Cross validation can help us make model selection
# The learning curve & Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
# # Model selection using cross validation
from sklearn.linear_model import LogisticRegression
# knn Model
knn = KNeighborsClassifier(n_neighbors=7)
# KNN Model accuracy 0.9916666666666666
print('KNN Model accuracy ',cross_val_score(knn,x_train,y_train,cv=10).mean())
# lr Model
lr = LogisticRegression()
# LR Model accuracy 0.9833333333333332
print('LR Model accuracy ',cross_val_score(lr,x_train,y_train,cv=10).mean())
K-Fold( As an understanding )
- Scikit Provided in K-Fold Of API
- n-split It's the discount
- shuffle Refers to whether to shuffle data
- random_state For random seeds , Fixed randomness
from numpy import array
from sklearn.model_selection import KFold
if __name__ == '__main__':
data = array([0.1,0.2,0.3,0.4,0.5,0.6])
kfold = KFold(n_splits=3,shuffle=True,random_state=1)
for train,test in kfold.split(data):
# train: [0.1 0.4 0.5 0.6],test:[0.2 0.3]
# train: [0.2 0.3 0.4 0.6],test:[0.1 0.5]
# train: [0.1 0.2 0.3 0.5],test:[0.4 0.6]
print('train: %s,test:%s' % (data[train],data[test]))
- Scikit Cross validation interface in sklearn.model_selection.cross_val_score, But the interface has no data shuffle function , So the general combination Kfold Use it together . If Train The data has passed before grouping shuffle Handle , For example, use train_test_split grouping , Then you can use it directly cross_val_score Interface
# Cross validation combined with Kfold
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
from sklearn.neighbors import KNeighborsClassifier
iris = ds.load_iris()
X,y = iris.data,iris.target
knn = KNeighborsClassifier(n_neighbors=5)
n_folds = 5
# Randomly split the training data
kf = KFold(n_folds,shuffle=True,random_state=42).get_n_splits(X)
# Cross validation
scores = cross_val_score(knn,X,y,cv=kf)
print(scores.mean())
版权声明
本文为[Ruo Xiaoyu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230602187134.html
边栏推荐
- Playwright contrôle l'ouverture de la navigation Google locale et télécharge des fichiers
- Logstash数据处理服务的输入插件Input常见类型以及基本使用
- NPM err code 500 solution
- Modification of table fields by Oracle
- Vscode tips
- Launcher hides app icons that do not need to be displayed
- Django::Did you install mysqlclient?
- Part 3: docker installing MySQL container (custom port)
- Processbuilder tool class
- 面试官给我挖坑:单台服务器并发TCP连接数到底可以有多少 ?
猜你喜欢
On the bug of JS regular test method
Loading and using image classification dataset fashion MNIST in pytorch
Usereducer basic usage
联想拯救者Y9000X 2020
切线空间(tangent space)
【视频】线性回归中的贝叶斯推断与R语言预测工人工资数据|数据分享
Lenovo Savior y9000x 2020
[Journal Conference Series] IEEE series template download guide
According to the salary statistics of programmers in June 2021, the average salary is 15052 yuan. Are you holding back?
TIA博途中基於高速計數器觸發中斷OB40實現定點加工動作的具體方法示例
随机推荐
Playwright contrôle l'ouverture de la navigation Google locale et télécharge des fichiers
Lenovo Savior y9000x 2020
Isparta is a tool that generates webp, GIF and apng from PNG and supports the transformation of webp, GIF and apng
MySQL 8.0.11 download, install and connect tutorials using visualization tools
The interviewer dug a hole for me: what's the use of "/ /" in URI?
面试官给我挖坑:单台服务器并发TCP连接数到底可以有多少 ?
[official announcement] Changsha software talent training base was established!
CSDN College Club "famous teacher college trip" -- Hunan Normal University Station
100000 college students have become ape powder. What are you waiting for?
Lpddr4 notes
Logstash数据处理服务的输入插件Input常见类型以及基本使用
Example of specific method for TIA to trigger interrupt ob40 based on high-speed counter to realize fixed-point machining action
You and the 42W bonus pool are one short of the "Changsha bank Cup" Tencent yunqi innovation competition!
playwright控制本地穀歌瀏覽打開,並下載文件
QT调用外部程序
缘结西安 | CSDN与西安思源学院签约,全面开启IT人才培养新篇章
Loading and using image classification dataset fashion MNIST in pytorch
Super 40W bonus pool waiting for you to fight! The second "Changsha bank Cup" Tencent yunqi innovation competition is hot!
Processbuilder tool class
What do the raddr and rport in webrtc ice candidate mean?