当前位置:网站首页>[machine learning] Note 4. KNN + cross validation
[machine learning] Note 4. KNN + cross validation
2022-04-23 13:41:00 【Ruo Xiaoyu】
KNN Classification model
- Concept : In short ,K- Neighbor algorithm uses the method of measuring the distance between different eigenvalues to classify (k-Nearest Neighbor ,KNN)
- k The role of value
- Euclid distance
- stay scikit-learn In the library k- Nearest neighbor algorithm
# Iris classification implementation
import sklearn.datasets as ds
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
# 4、 Observation data set : See if you need to deal with Feature Engineering
x_train.shape
# 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
# Super parameters of the model : If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model
knn = KNeighborsClassifier(n_neighbors=3)#n_neighbors == K
# 6、 Training model with training set data
# X: Characteristic data of training set . The dimension of feature data must be two-dimensional
# Y: Tag data of training set
knnModel = knn.fit(x_train,y_train)
print(knnModel)
# 7、 test model : Using test data
y_pred= knnModel.predict(x_test) # The model predicts the returned classification results based on the test data
y_true = y_test # The real classification results of the test set
# 7.1、 Compare the model classification results with the real classification results
# The classification results of the model : [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 2 0 0]
# It's really a classification result : [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 1 0 0]
print(' The classification results of the model :',y_pred)
print(' It's really a classification result :',y_true)
# 7.2、 Calculate the prediction accuracy of the model
# Enter the reference X: Characteristic data of the test set
# Enter the reference Y: Label data for test set
score = knnModel.score(x_test,y_test)
# Model prediction accuracy
# score= 0.9666666666666667
print('score=',score)
# 8、 Use the model to predict the target data set
targetResult = knnModel.predict([[6.1,5.1,4.5,3.6],[2.1,3.1,4.5,5.6]])
# Predicted results : [2 2]
print(' Predicted results :',targetResult)
- How to choose the best k value
- Learning curve to find the best k value
# Find the best through the learning curve K value
import numpy as np
scores = []
ks = []
# For example, we try k The value range is 1 to 50
for i in range(1,50):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)# Training models
score = knn.score(x_test,y_test)# Calculate the prediction accuracy of the model
ks.append(i)
scores.append(score)
scores_arr = np.array(scores)
ks_arr = np.array(ks)
# %matplotliib inline
import matplotlib.pyplot as plt
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_value')
plt.ylabel('score')
plt.show()
K Crossover verification
- Purpose :
- Select the most suitable value of model superparameters , Then the value of the super parameter is applied to the creation of the model .
- thought :
- Cross split the training data of the sample into different training sets and verification sets , Cross split different training sets and verification sets are used to test the accuracy of the model , Then the average of the accuracy is the result of cross validation here . Cross validation into different super parameters , Select the super parameter with the highest accuracy as the super parameter of the model .
- Realize the idea
- Divide the data set evaluation into K Equal parts
- Use 1 Data as test data , The rest are used as training data
- Calculate the test accuracy
- Use different test sets , repeat 2、3 step
- Average the accuracy , As an estimate of the prediction accuracy of unknown data
- API
- from sklearn.model_selection import cross_val_score
- cross_val_score(estimator,X,y,cv)
- estimator: Model object
- X,y; Training set data
- cv: Discount
- Cross validation in KNN Basic use in
# Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
# 4、 Observation data set : See if you need to deal with Feature Engineering
x_train.shape
# 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
# Super parameters of the model : If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model
knn = KNeighborsClassifier(n_neighbors=5)#n_neighbors == K
# Cross verify the training set
crossScore= cross_val_score(knn,x_train,y_train,cv=5).mean()
print(crossScore)
- Use cross validation & Learning curve to find the optimal hyperparameter
# The learning curve & Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
import matplotlib.pyplot as plt
import numpy as np
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
scores = []
ks = []
for k in range(3,20):
knn = KNeighborsClassifier(n_neighbors=k)
cross_score = cross_val_score(knn,x_train,y_train,cv=6).mean()
scores.append(cross_score)
ks.append(k)
ks_arr = np.array(ks)
scores_arr = np.array(scores)
plt.plot(ks_arr, scores_arr)
plt.xlabel('k_value')
plt.ylabel('score')
plt.show()
# Take the subscript of the maximum value
max_idx = scores_arr.argmax()
# The maximum value corresponds to k value
max_k = ks[max_idx]
# The maximum subscript : 4
print(' The maximum subscript :',max_idx)
# The maximum value corresponds to k value : 7
print(' The maximum value corresponds to k value :',max_k)
- Cross validation can help us make model selection
# The learning curve & Cross validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
# 1、 Get iris datasets
iris = ds.load_iris()
# 2、 Extract sample data
feature = iris['data'] # Characteristic data
target = iris['target'] # Tag data
# 3、 Split the dataset
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
# # Model selection using cross validation
from sklearn.linear_model import LogisticRegression
# knn Model
knn = KNeighborsClassifier(n_neighbors=7)
# KNN Model accuracy 0.9916666666666666
print('KNN Model accuracy ',cross_val_score(knn,x_train,y_train,cv=10).mean())
# lr Model
lr = LogisticRegression()
# LR Model accuracy 0.9833333333333332
print('LR Model accuracy ',cross_val_score(lr,x_train,y_train,cv=10).mean())
K-Fold( As an understanding )
- Scikit Provided in K-Fold Of API
- n-split It's the discount
- shuffle Refers to whether to shuffle data
- random_state For random seeds , Fixed randomness
from numpy import array
from sklearn.model_selection import KFold
if __name__ == '__main__':
data = array([0.1,0.2,0.3,0.4,0.5,0.6])
kfold = KFold(n_splits=3,shuffle=True,random_state=1)
for train,test in kfold.split(data):
# train: [0.1 0.4 0.5 0.6],test:[0.2 0.3]
# train: [0.2 0.3 0.4 0.6],test:[0.1 0.5]
# train: [0.1 0.2 0.3 0.5],test:[0.4 0.6]
print('train: %s,test:%s' % (data[train],data[test]))
- Scikit Cross validation interface in sklearn.model_selection.cross_val_score, But the interface has no data shuffle function , So the general combination Kfold Use it together . If Train The data has passed before grouping shuffle Handle , For example, use train_test_split grouping , Then you can use it directly cross_val_score Interface
# Cross validation combined with Kfold
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
from sklearn.neighbors import KNeighborsClassifier
iris = ds.load_iris()
X,y = iris.data,iris.target
knn = KNeighborsClassifier(n_neighbors=5)
n_folds = 5
# Randomly split the training data
kf = KFold(n_folds,shuffle=True,random_state=42).get_n_splits(X)
# Cross validation
scores = cross_val_score(knn,X,y,cv=kf)
print(scores.mean())
版权声明
本文为[Ruo Xiaoyu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230602187134.html
边栏推荐
- Oracle database combines the query result sets of multiple columns into one row
- ./gradlew: Permission denied
- Detailed explanation of constraints of Oracle table
- Lpddr4 notes
- ARGB transparency conversion
- 为什么从事云原生开发需要学习容器技术
- [dynamic programming] 221 Largest Square
- Oracle view related
- [point cloud series] foldingnet: point cloud auto encoder via deep grid deformation
- POM of SSM integration xml
猜你喜欢
切线空间(tangent space)
QT调用外部程序
[indicators] precision, recall
零拷贝技术
Why do you need to learn container technology to engage in cloud native development
Xi'an CSDN signed a contract with Xi'an Siyuan University, opening a new chapter in IT talent training
100000 college students have become ape powder. What are you waiting for?
榜样专访 | 孙光浩:高校俱乐部伴我成长并创业
[barycentric coordinate interpolation, perspective correction interpolation] principle and usage opinions
The difference between string and character array in C language
随机推荐
You and the 42W bonus pool are one short of the "Changsha bank Cup" Tencent yunqi innovation competition!
切线空间(tangent space)
Ai21 labs | standing on the shoulders of giant frozen language models
Generate 32-bit UUID in Oracle
[point cloud series] Introduction to scene recognition
[point cloud series] deepmapping: unsupervised map estimation from multiple point clouds
How do ordinary college students get offers from big factories? Ao Bing teaches you one move to win!
Machine learning -- model optimization
Machine learning -- PCA and LDA
[Journal Conference Series] IEEE series template download guide
这个SQL语名是什么意思
集简云 x 飞书深诺,助力企业运营部实现自动化办公
Oracle defines self incrementing primary keys through triggers and sequences, and sets a scheduled task to insert a piece of data into the target table every second
Machine learning -- naive Bayes
GDB的使用
Processbuilder tool class
JS compares different elements in two arrays
Feature Engineering of interview summary
Test the time required for Oracle library to create an index with 7 million data in a common way
Opening: identification of double pointer instrument panel