当前位置：网站首页>[machine learning] Note 4. KNN + cross validation

[machine learning] Note 4. KNN + cross validation

2022-04-23 13:41:00 【Ruo Xiaoyu】

KNN Classification model

Concept ： In short ,K- Neighbor algorithm uses the method of measuring the distance between different eigenvalues to classify （k-Nearest Neighbor ,KNN）
k The role of value
Euclid distance
stay scikit-learn In the library k- Nearest neighbor algorithm

    #  Iris classification implementation 
import sklearn.datasets as ds
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
    # 1、 Get iris datasets 
    iris = ds.load_iris()
    # 2、 Extract sample data 
    feature = iris['data']  #  Characteristic data 
    target = iris['target']  #  Tag data 
    # 3、 Split the dataset 
    x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
    # 4、 Observation data set ： See if you need to deal with Feature Engineering 
    x_train.shape
    # 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
    #  Super parameters of the model ： If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model 
    knn = KNeighborsClassifier(n_neighbors=3)#n_neighbors == K
    # 6、 Training model with training set data 
    # X： Characteristic data of training set . The dimension of feature data must be two-dimensional 
    # Y： Tag data of training set 
    knnModel = knn.fit(x_train,y_train)
    print(knnModel)
    # 7、 test model ： Using test data 
    y_pred= knnModel.predict(x_test) # The model predicts the returned classification results based on the test data 
    y_true = y_test # The real classification results of the test set 

    # 7.1、 Compare the model classification results with the real classification results 
    #  The classification results of the model ： [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 2 0 0]
    #  It's really a classification result ： [0 0 1 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 0 1 1 2 2 0 2 1 1 1 0 0]
    print(' The classification results of the model ：',y_pred)
    print(' It's really a classification result ：',y_true)
    # 7.2、 Calculate the prediction accuracy of the model 
    #  Enter the reference X： Characteristic data of the test set 
    #  Enter the reference Y： Label data for test set 
    score = knnModel.score(x_test,y_test)
    #  Model prediction accuracy 
    # score= 0.9666666666666667
    print('score=',score)
    # 8、 Use the model to predict the target data set 
    targetResult = knnModel.predict([[6.1,5.1,4.5,3.6],[2.1,3.1,4.5,5.6]])
    #  Predicted results : [2 2]
    print(' Predicted results :',targetResult)

How to choose the best k value
- Learning curve to find the best k value

    #  Find the best through the learning curve K value 
    import numpy as np
    scores = []
    ks = []
    #  For example, we try k The value range is 1 to 50
    for i in range(1,50):
        knn = KNeighborsClassifier(n_neighbors=i)
        knn.fit(x_train,y_train)# Training models 
        score = knn.score(x_test,y_test)# Calculate the prediction accuracy of the model 
        ks.append(i)
        scores.append(score)
    scores_arr = np.array(scores)
    ks_arr = np.array(ks)
    # %matplotliib inline
    import matplotlib.pyplot as plt
    plt.plot(ks_arr,scores_arr)
    plt.xlabel('k_value')
    plt.ylabel('score')
    plt.show()

Insert picture description here

K Crossover verification

Purpose ：
- Select the most suitable value of model superparameters , Then the value of the super parameter is applied to the creation of the model .
thought ：
- Cross split the training data of the sample into different training sets and verification sets , Cross split different training sets and verification sets are used to test the accuracy of the model , Then the average of the accuracy is the result of cross validation here . Cross validation into different super parameters , Select the super parameter with the highest accuracy as the super parameter of the model .
Realize the idea
1. Divide the data set evaluation into K Equal parts
2. Use 1 Data as test data , The rest are used as training data
3. Calculate the test accuracy
4. Use different test sets , repeat 2、3 step
5. Average the accuracy , As an estimate of the prediction accuracy of unknown data
API
- from sklearn.model_selection import cross_val_score
- cross_val_score(estimator,X,y,cv)
  - estimator： Model object
  - X,y; Training set data
  - cv： Discount
Cross validation in KNN Basic use in

#  Cross validation 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
   
    # 1、 Get iris datasets 
    iris = ds.load_iris()
    # 2、 Extract sample data 
    feature = iris['data']  #  Characteristic data 
    target = iris['target']  #  Tag data 
    # 3、 Split the dataset 
    x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2021)
    # 4、 Observation data set ： See if you need to deal with Feature Engineering 
    x_train.shape
    # 5、 Instantiate model objects (knn Medium k Different values will directly lead to different classification results )
    #  Super parameters of the model ： If the model parameters have different values, and different values will have a direct impact on the classification or prediction results of the model 
    knn = KNeighborsClassifier(n_neighbors=5)#n_neighbors == K
	#  Cross verify the training set 
    crossScore= cross_val_score(knn,x_train,y_train,cv=5).mean()
    print(crossScore)

Use cross validation & Learning curve to find the optimal hyperparameter

#  The learning curve & Cross validation 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
import matplotlib.pyplot as plt
import numpy as np
if __name__ == '__main__':
    # 1、 Get iris datasets 
    iris = ds.load_iris()
    # 2、 Extract sample data 
    feature = iris['data']  #  Characteristic data 
    target = iris['target']  #  Tag data 
    # 3、 Split the dataset 
    x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
    scores = []
    ks = []
    for k in range(3,20):
        knn = KNeighborsClassifier(n_neighbors=k)
        cross_score = cross_val_score(knn,x_train,y_train,cv=6).mean()
        scores.append(cross_score)
        ks.append(k)
    ks_arr = np.array(ks)
    scores_arr = np.array(scores)
    plt.plot(ks_arr, scores_arr)
    plt.xlabel('k_value')
    plt.ylabel('score')
    plt.show()
    #  Take the subscript of the maximum value 
    max_idx = scores_arr.argmax()
    # The maximum value corresponds to k value 
    max_k = ks[max_idx]
    #  The maximum subscript ： 4
    print(' The maximum subscript ：',max_idx)
    #  The maximum value corresponds to k value ： 7
    print(' The maximum value corresponds to k value ：',max_k)

Insert picture description here

Cross validation can help us make model selection

#  The learning curve & Cross validation 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.datasets as ds
if __name__ == '__main__':
    # 1、 Get iris datasets 
    iris = ds.load_iris()
    # 2、 Extract sample data 
    feature = iris['data']  #  Characteristic data 
    target = iris['target']  #  Tag data 
    # 3、 Split the dataset 
    x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=2020)
    # #  Model selection using cross validation 
    from sklearn.linear_model import LogisticRegression
    # knn Model 
    knn = KNeighborsClassifier(n_neighbors=7)
    # KNN Model accuracy  0.9916666666666666
    print('KNN Model accuracy ',cross_val_score(knn,x_train,y_train,cv=10).mean())
    # lr Model 
    lr = LogisticRegression()
    # LR Model accuracy  0.9833333333333332
    print('LR Model accuracy ',cross_val_score(lr,x_train,y_train,cv=10).mean())

K-Fold（ As an understanding ）

Scikit Provided in K-Fold Of API
- n-split It's the discount
- shuffle Refers to whether to shuffle data
- random_state For random seeds , Fixed randomness

from numpy import array
from sklearn.model_selection import KFold
if __name__ == '__main__':
    data = array([0.1,0.2,0.3,0.4,0.5,0.6])
    kfold = KFold(n_splits=3,shuffle=True,random_state=1)

    for train,test in kfold.split(data):
        # train: [0.1 0.4 0.5 0.6],test:[0.2 0.3]
    	# train: [0.2 0.3 0.4 0.6],test:[0.1 0.5]
    	# train: [0.1 0.2 0.3 0.5],test:[0.4 0.6]
        print('train: %s,test:%s' % (data[train],data[test]))

Scikit Cross validation interface in sklearn.model_selection.cross_val_score, But the interface has no data shuffle function , So the general combination Kfold Use it together . If Train The data has passed before grouping shuffle Handle , For example, use train_test_split grouping , Then you can use it directly cross_val_score Interface

	#  Cross validation combined with Kfold
    from sklearn.model_selection import cross_val_score
    import sklearn.datasets as ds
    from sklearn.neighbors import KNeighborsClassifier
    iris = ds.load_iris()
    X,y = iris.data,iris.target
    knn = KNeighborsClassifier(n_neighbors=5)

    n_folds = 5
    #  Randomly split the training data 
    kf = KFold(n_folds,shuffle=True,random_state=42).get_n_splits(X)
    #  Cross validation 
    scores = cross_val_score(knn,X,y,cv=kf)
    print(scores.mean())

版权声明
本文为[Ruo Xiaoyu]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230602187134.html

当前位置：网站首页>[machine learning] Note 4. KNN + cross validation

[machine learning] Note 4. KNN + cross validation

KNN Classification model

K Crossover verification

K-Fold（ As an understanding ）

边栏推荐

猜你喜欢

随机推荐