当前位置：网站首页>Logic regression principle and code implementation

Logic regression principle and code implementation

2022-04-23 17:53:00 【Stephen_ Tao】

List of articles

The principle of logical regression
Logistic regression code implementation
summary

The principle of logical regression

Logistic regression is mainly used to solve binary classification problems , Given an input sample $x$ , The output sample belongs to 1 Prediction probability of corresponding category $\hat{y} = P(y=1|x)$ .
Compared with linear regression , Logistic regression adds a nonlinear function , Such as Sigmoid function , Make the output value in [0,1] In the interval of , And set the threshold for classification .

The main parameters of logistic regression

Input eigenvector ： $\in R^n$ ( Indicates that the input sample has $n$ Eigenvalues ), $\in {0,1}$ ( Label indicating the sample )
Weight and bias of logistic regression ： $\in R^n$ , $\in R$
The predicted results of the output ： $\hat{y} = \sigma(w^Tx+b)=\sigma(w_1x_1+w_2x_2+...+w_nx_n+b)$
( $\sigma$ It's usually Sigmoid function ）

The process of logistic regression

Get your training data ready $\in R^{n \times m}$ ( $m$ Number of samples , $n$ Represents the eigenvalue of each sample ), Training tag data $\in R^m$
Initialize weights and offsets $\in R^n$ , $\in R$
Data forward propagation $\hat{y} = \sigma(w^Tx+b)$ , The loss is calculated by the maximum likelihood loss function
The gradient descent method is used to update the weight $w$ And offset $b$ , Minimum loss function
The predicted weight and bias are used for data forward propagation , Set classification threshold , Realize the second classification task

Loss function of logistic regression

The square error is used as the loss function in linear regression , In logistic regression, the maximum likelihood loss function is generally used to measure the error between the predicted result and the real value .
The loss value of a single sample is calculated as follows ：
$L(\hat{y},y)=-y\log\hat{y}-(1-y)\log(1-\hat{y})$

If the sample belongs to the label 1, be $L(\hat{y},y)=-\log\hat{y}$ , The closer the prediction is to 1, $L(\hat{y},y)$ The smaller the value of
If the sample belongs to the label 0, be $L(\hat{y},y)=-\log(1-\hat{y})$ , The closer the prediction is to 0, $L(\hat{y},y)$ The smaller the value of

The loss value of all training samples is calculated as follows :
$J(w,b)=\frac{1}{m}\sum^m_{i=1}L(\hat{y}^{(i)},y^{(i)})$

Gradient descent algorithm

The purpose of the gradient descent method is to minimize the loss function , The gradient of the function indicates the direction in which the function changes the fastest .

Insert picture description here

As shown in the figure above , hypothesis $J (w, b)$ It's about $w$ and $b$ Function of , $\in R$
Calculate the gradient at the initial point , Set the learning rate , Update and iterate the weight and offset parameters , Finally, we can reach the minimum value of the function .
The update formula of weight and offset is ：
$w=w-\alpha\frac{\partial{J(w,b)}}{\partial{w}}$

$b=b-\alpha\frac{\partial{J(w,b)}}{\partial{b}}$

notes ： among $\alpha$ For learning rate , That is, every update $w, b$ Step size of

Gradient calculation based on chain rule

The formula of parameter updating is introduced in the gradient descent method , You can see that the parameter update involves the calculation of the gradient , This section will introduce the flow of gradient calculation in detail .
Simplicity , Make the following assumptions ：

input data $\in R^{3 \times 1}$ (1 Samples ,3 Eigenvalues )
label $\in R^{}$
The weight $\in R^{3 \times1}$
bias $\in R$

For the sake of understanding , The data flow of logistic regression is represented by flow chart ：

Insert picture description here

1. Calculation $J$ About $z$ The derivative of

$\frac{\partial{J}}{\partial{z}}=\frac{\partial{J}}{\partial{\hat{y}}}\frac{\partial{\hat{y}}}{\partial{z}}$
$\frac{\partial{J}}{\partial{\hat{y}}}=\frac{-y}{\hat{y}}+\frac{1-y}{1-\hat{y}}$ $\quad$ $\quad$ $\frac{\partial{\hat{y}}}{\partial{z}}=\hat{y}(1-\hat{y})$
$\frac{\partial{J}}{\partial{z}}=\frac{\partial{J}}{\partial{\hat{y}}}\frac{\partial{\hat{y}}}{\partial{z}}=-y(1-\hat{y})+(1-y)\hat{y}=\hat{y}-y$

2. Calculation $z$ About $w$ and $b$ The derivative of

$\frac{\partial{z}}{\partial{w_1}}=x1$ $\quad$ $\quad$ $\frac{\partial{z}}{\partial{w_2}}=x2$ $\quad$ $\quad$ $\frac{\partial{z}}{\partial{w_3}}=x3$
$\frac{\partial{z}}{\partial{b}}=1$

3. Calculation $J$ About $w$ and $b$ The derivative of

$\frac{\partial{J}}{\partial{w_1}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{w_1}}=(\hat{y}-y)x_1$
$\frac{\partial{J}}{\partial{w_2}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{w_2}}=(\hat{y}-y)x_2$
$\frac{\partial{J}}{\partial{w_3}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{w_3}}=(\hat{y}-y)x_3$
$\frac{\partial{J}}{\partial{b}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{b}}=(\hat{y}-y)$

Vectorization realizes gradient calculation

In the last section , For a single data sample , This paper introduces how to calculate the gradient . But in practice , There can't be only one data sample , Therefore, it is necessary to calculate the gradient based on the loss function of all data samples . In this section, the gradient calculation of multiple data samples will be realized by vectorization , Vectorization is relative to for The way of circulation , Can save a lot of time , Improve the efficiency of operation .
First, declare the structure of the data ：

input data $\in R^{n \times m}$ (m Samples ,n Eigenvalues )
label $\in R^{m \times 1}$
The weight $\in R^{n \times1}$
bias $\in R$

The gradient descent process is as follows ：

$Z=W^TX+b$ $\quad$ $Z\in R^{1\times m}$
$\hat{Y}=\sigma(Z)$ $\quad$ $\hat{Y}\in R^{1\times m}$
$\frac{\partial{J}}{\partial{z}}=\hat{Y}-Y$ $\quad$ $\frac{\partial{J}}{\partial{z}}\in R^{1\times m}$
$\frac{\partial{J}}{\partial{W}}=\frac{1}{m}X(\hat{Y}-Y)^T$ $\quad$ $\frac{\partial{J}}{\partial{W}}\in R^{n\times 1}$
$\frac{\partial{J}}{\partial{b}}=\frac{1}{m}np.sum(\hat{Y}-Y)$ $\quad$ $\frac{\partial{J}}{\partial{b}}\in R$
$W-\alpha \frac{\partial{J}}{\partial{W}}$ $\quad$ $b-\alpha \frac{\partial{J}}{\partial{b}}$

Logistic regression code implementation

Obtain secondary classification data

from sklearn.datasets import load_iris,make_classification
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np

#  Generate 500 A sample points , The sample category is only 2 Kind of , The eigenvalues of each sample are 4 individual  
X,Y=make_classification(n_samples=500,n_features=4,n_classes=2)

#  take 30% As a test set ,70% Data as a training set 
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
print("X:",X.shape)
print("Y:",Y.shape)
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)

The output is ：

X: (500, 4)
Y: (500,)
x_train: (350, 4)
x_test: (150, 4)
y_train: (350,)
y_test: (150,)

Define initialization module

def initialize_with_zeros(shape):
    """  Create a shape as  (shape, 1)  Of w Parameters and b=0. return:w, b """
    w = np.zeros((shape, 1))
    b = 0
    return w, b

Define loss function and gradient

def basic_sigmoid(x):
    """  Calculation sigmoid function  """
    s = 1 / (1 + np.exp(-x))
    return s

def propagate(w, b, X, Y):
    """  Parameters ：w,b,X,Y： Network parameters and data  Return:  Loss cost、 Parameters W Gradient of dw、 Parameters b Gradient of db """
    m = X.shape[1]

    # w (n,1), x (n, m)
    A = basic_sigmoid(np.dot(w.T, X) + b)
    #  Calculate the loss 
    cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
    dz = A - Y
    dw = 1 / m * np.dot(X, dz.T)
    db = 1 / m * np.sum(dz)

    cost = np.squeeze(cost)

    grads = {
    "dw": dw,
             "db": db}
    return grads, cost

Define gradient descent algorithm

def optimize(w, b, X, Y, num_iterations, learning_rate):
    """  Parameters ： w: The weight ,b: bias ,X features ,Y The target ,num_iterations The total number of iterations ,learning_rate Learning rate  Returns: params: Updated parameter Dictionary  grads: gradient  costs: Loss results  """

    costs = []

    for i in range(num_iterations):

        #  Gradient update calculation function 
        grads, cost = propagate(w, b, X, Y)

        #  Take out the gradient of two partial parameters 
        dw = grads['dw']
        db = grads['db']

        #  Calculate according to the gradient descent formula 
        w = w - learning_rate * dw
        b = b - learning_rate * db

        if i % 100 == 0:
            costs.append(cost)
            print(" Loss results  %i: %f" %(i, cost))


    params = {
    "w": w,"b": b}

    grads = {
    "dw": dw,"db": db}
    return params, grads, costs

Define prediction module

def predict(w, b, X):
    '''  Use the trained parameters to predict  return： Predicted results  '''

    m = X.shape[1]
    y_prediction = np.zeros((1, m))
    w = w.reshape(X.shape[0], 1)

    #  The result of the calculation is 
    A = basic_sigmoid(np.dot(w.T, X) + b)

    for i in range(A.shape[1]):

        if A[0, i] <= 0.5:
            y_prediction[0, i] = 0
        else:
            y_prediction[0, i] = 1

    return y_prediction

Define a logistic regression model

def model(x_train, y_train, x_test, y_test, num_iterations=2000, learning_rate=0.0001):
    """ """

    #  Modify the data shape 
    x_train = x_train.reshape(-1, x_train.shape[0])
    x_test = x_test.reshape(-1, x_test.shape[0])
    y_train = y_train.reshape(1, y_train.shape[0])
    y_test = y_test.reshape(1, y_test.shape[0])
    print(x_train.shape)
    print(x_test.shape)
    print(y_train.shape)
    print(y_test.shape)

    # 1、 Initialize parameters 
    w, b = initialize_with_zeros(x_train.shape[0])

    # 2、 gradient descent 
    # params: Updated network parameters 
    # grads: Last gradient 
    # costs: Loss list updated each time 
    params, grads, costs = optimize(w, b, x_train, y_train, num_iterations, learning_rate)

    #  Get training parameters 
    #  Predicted results 
    w = params['w']
    b = params['b']
    y_prediction_train = predict(w, b, x_train)
    y_prediction_test = predict(w, b, x_test)

    #  Print accuracy 
    print(" Training set accuracy : {} ".format(100 - np.mean(np.abs(y_prediction_train - y_train)) * 100))
    print(" Test set accuracy : {} ".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))

    return None

Run the model

model(x_train, y_train, x_test, y_test, num_iterations=3000, learning_rate=0.01)

Set the number of iterations to 3000, The learning rate is set to 0.01, Running will get the following results ：

(4, 350)
(4, 150)
(1, 350)
(1, 150)
 Loss results  0: 0.693147
 Loss results  100: 0.685711
 Loss results  200: 0.681650
 Loss results  300: 0.679411
 Loss results  400: 0.678165
 Loss results  500: 0.677465
 Loss results  600: 0.677069
 Loss results  700: 0.676843
 Loss results  800: 0.676713
 Loss results  900: 0.676639
 Loss results  1000: 0.676595
 Loss results  1100: 0.676570
 Loss results  1200: 0.676556
 Loss results  1300: 0.676547
 Loss results  1400: 0.676542
 Loss results  1500: 0.676539
 Loss results  1600: 0.676538
 Loss results  1700: 0.676537
 Loss results  1800: 0.676536
 Loss results  1900: 0.676536
 Loss results  2000: 0.676535
 Loss results  2100: 0.676535
 Loss results  2200: 0.676535
 Loss results  2300: 0.676535
 Loss results  2400: 0.676535
 Loss results  2500: 0.676535
 Loss results  2600: 0.676535
 Loss results  2700: 0.676535
 Loss results  2800: 0.676535
 Loss results  2900: 0.676535
 Training set accuracy : 60.57142857142857 
 Test set accuracy : 56.0

The change of loss function value is shown in the figure below ：
Insert picture description here

summary

This paper introduces the principle of logistic regression in detail , And make use of Python A case of realizing logical regression . From the running results, we can see that , As the number of iterations increases , The value of the loss function does not always drop close to 0 The location of , It's stable 0.6 near . meanwhile , The prediction accuracy of the training set is 60.57%, The accuracy of the prediction for the test set is 56%. Therefore, although logical regression is simple and easy to understand , The interpretability of the model is very good , But because the form of the model is relatively simple , Can't fit the real distribution of data well , So the accuracy is often not very high .

版权声明
本文为[Stephen_ Tao]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230548468772.html