当前位置：网站首页>Fundamentals of machine learning theory -- some terms about machine learning

Fundamentals of machine learning theory -- some terms about machine learning

2022-04-23 18:39:00 【Capture bamboo shoots 123】

Catalog

cost function （ error ）
Model accuracy
Cross validation data sets
The learning curve
Over fitting
Under fitting

This blog reference book ：scikit-learn machine learning – Common algorithm principle and programming practice

cost function （ error ）

Measure the consistency between the model and the training sample
cost For all training samples , The value fitted by the model is the same as the real value of the training sample Average error
cost function Is the functional relationship between cost and model parameters

$J_{train}(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^i)-t^i)^2$
among $h(x^i)$ Represents the prediction label of the model for each sample value , $t^i$ Represent the real label of each sample

The training process of the model is to find Appropriate model parameters bring The value of the cost function is the smallest

Model accuracy

Multiple models may be used to fit a data set （ For example, using first-order polynomials 、 Second order polynomial 、…、 Multiorder polynomial ）, We tend to choose the one that performs best from these models , So how to evaluate the performance of a model ？

We often use the cost function value of the test set as the index , $J_{test}(\theta)$ The brighter the value, the smaller the error between the predicted value of the model and the actual value of the sample , That is, the better the prediction accuracy of new data
$J_{test}(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^i)-t^i)^2$

stay sklearn Interfaces are often used in score(x,y) To evaluate the performance of a model

Cross validation data sets

If you have a dataset now , We want to get some information from it , There are multiple models to choose from , Then we need to do the following three things
1. Train model parameters with possible multiple models
2. Select the best model from multiple models
3. Evaluate the prediction accuracy of this model

The main purpose of testing the data set is to test the accuracy of the model , And this process requires the use of models without “ Yes ” The data of , If step 2 Using test data , Then the data is “ Yes ”, To solve this problem , We can divide the data set into 3 part , The extra one is Cross validation data sets

Many times we don't use Cross validation data sets , This is because most of the time for a data set , We know what model to use

The learning curve

Take the cost function values of training data set and test data set as the vertical axis , The training dataset size is used as the horizontal axis , Draw a curve
Use sklearn Draw the learning curve with the interface provided in

from sklearn.model_selection import learning_curve,ShuffleSplit

def plot_learning_curve(estimator,x,y,cv=None,n_jobs=1,train_size=np.linspace(.1,1.0,5)):
    train_size,train_score,test_score=learning_curve(estimator,x,y,cv=cv,n_jobs=n_jobs,train_sizes=train_size)
    #  Calculating mean , variance 
    train_score_mean=np.mean(train_score,axis=1)
    train_score_std=np.std(train_score,axis=1)
    test_score_mean=np.mean(test_score,axis=1)
    test_score_std=np.std(test_score,axis=1)
    plt.plot(train_size,train_score_mean,'o-',c='r')
    plt.plot(train_size,test_score_mean,'o-',c='g')
    return plt

The meaning of the learning curve ： With the training data set （ The amount of training data ） An increase in , The accuracy of model fitting to training data , The prediction accuracy of cross validation data set changes

Over fitting

The model can fit the training samples very well , Cross validation data sets （ The new data ） The prediction accuracy of is low
resolvent
Get more training data
When fitting has happened , Increasing the amount of data can effectively improve the performance of the model
Reduce the number of input features
Over fitting shows that the model is too complex to some extent , This is what we can try to reduce the number of input features , This can reduce the amount of calculation of the model , It also reduces the complexity of the model

Under fitting

The model can't fit the training samples well , Cross validation data sets （ The new data ） The prediction accuracy is also low

Add valuable features
Under fitting shows that the model is a little simple , The reason may be that the number of input features is too small , We can mine more new features from the original data

Add the characteristics of polynomials
Sometimes it is not easy to mine features from original data , At this time, we can multiply some of the original features or square them as new features , This is equivalent to increasing the order of a model

$x_1,x_2\rightarrow x_1^2,x_2^2,x_1x_2$