当前位置:网站首页>Logic regression principle and code implementation
Logic regression principle and code implementation
2022-04-23 17:53:00 【Stephen_ Tao】
List of articles
The principle of logical regression
Logistic regression is mainly used to solve binary classification problems , Given an input sample x x x, The output sample belongs to 1 Prediction probability of corresponding category y ^ = P ( y = 1 ∣ x ) \hat{y} = P(y=1|x) y^=P(y=1∣x).
Compared with linear regression , Logistic regression adds a nonlinear function , Such as Sigmoid function , Make the output value in [0,1] In the interval of , And set the threshold for classification .
The main parameters of logistic regression
- Input eigenvector : x ∈ R n x \in R^n x∈Rn( Indicates that the input sample has n n n Eigenvalues ), y ∈ 0 , 1 y \in {0,1} y∈0,1( Label indicating the sample )
- Weight and bias of logistic regression : w ∈ R n w \in R^n w∈Rn, b ∈ R b \in R b∈R
- The predicted results of the output : y ^ = σ ( w T x + b ) = σ ( w 1 x 1 + w 2 x 2 + . . . + w n x n + b ) \hat{y} = \sigma(w^Tx+b)=\sigma(w_1x_1+w_2x_2+...+w_nx_n+b) y^=σ(wTx+b)=σ(w1x1+w2x2+...+wnxn+b)
( σ \sigma σ It's usually Sigmoid function )
The process of logistic regression
- Get your training data ready x ∈ R n × m x \in R^{n \times m} x∈Rn×m( m m m Number of samples , n n n Represents the eigenvalue of each sample ), Training tag data y ∈ R m y \in R^m y∈Rm
- Initialize weights and offsets w ∈ R n w \in R^n w∈Rn, b ∈ R b \in R b∈R
- Data forward propagation y ^ = σ ( w T x + b ) \hat{y} = \sigma(w^Tx+b) y^=σ(wTx+b), The loss is calculated by the maximum likelihood loss function
- The gradient descent method is used to update the weight w w w And offset b b b, Minimum loss function
- The predicted weight and bias are used for data forward propagation , Set classification threshold , Realize the second classification task
Loss function of logistic regression
The square error is used as the loss function in linear regression , In logistic regression, the maximum likelihood loss function is generally used to measure the error between the predicted result and the real value .
The loss value of a single sample is calculated as follows :
L ( y ^ , y ) = − y log y ^ − ( 1 − y ) log ( 1 − y ^ ) L(\hat{y},y)=-y\log\hat{y}-(1-y)\log(1-\hat{y}) L(y^,y)=−ylogy^−(1−y)log(1−y^)
- If the sample belongs to the label 1, be L ( y ^ , y ) = − log y ^ L(\hat{y},y)=-\log\hat{y} L(y^,y)=−logy^, The closer the prediction is to 1, L ( y ^ , y ) L(\hat{y},y) L(y^,y) The smaller the value of
- If the sample belongs to the label 0, be L ( y ^ , y ) = − log ( 1 − y ^ ) L(\hat{y},y)=-\log(1-\hat{y}) L(y^,y)=−log(1−y^), The closer the prediction is to 0, L ( y ^ , y ) L(\hat{y},y) L(y^,y) The smaller the value of
The loss value of all training samples is calculated as follows :
J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) J(w,b)=\frac{1}{m}\sum^m_{i=1}L(\hat{y}^{(i)},y^{(i)}) J(w,b)=m1∑i=1mL(y^(i),y(i))
Gradient descent algorithm
The purpose of the gradient descent method is to minimize the loss function , The gradient of the function indicates the direction in which the function changes the fastest .
As shown in the figure above , hypothesis J ( w , b ) J(w,b) J(w,b) It's about w w w and b b b Function of , w , b ∈ R w,b \in R w,b∈R
Calculate the gradient at the initial point , Set the learning rate , Update and iterate the weight and offset parameters , Finally, we can reach the minimum value of the function .
The update formula of weight and offset is :
w = w − α ∂ J ( w , b ) ∂ w w=w-\alpha\frac{\partial{J(w,b)}}{\partial{w}} w=w−α∂w∂J(w,b)
b = b − α ∂ J ( w , b ) ∂ b b=b-\alpha\frac{\partial{J(w,b)}}{\partial{b}} b=b−α∂b∂J(w,b)
notes : among α \alpha α For learning rate , That is, every update w , b w,b w,b Step size of
Gradient calculation based on chain rule
The formula of parameter updating is introduced in the gradient descent method , You can see that the parameter update involves the calculation of the gradient , This section will introduce the flow of gradient calculation in detail .
Simplicity , Make the following assumptions :
- input data x ∈ R 3 × 1 x \in R^{3 \times 1} x∈R3×1(1 Samples ,3 Eigenvalues )
- label y ∈ R y \in R^{} y∈R
- The weight w ∈ R 3 × 1 w \in R^{3 \times1} w∈R3×1
- bias b ∈ R b \in R b∈R
For the sake of understanding , The data flow of logistic regression is represented by flow chart :
1. Calculation J J J About z z z The derivative of
- ∂ J ∂ z = ∂ J ∂ y ^ ∂ y ^ ∂ z \frac{\partial{J}}{\partial{z}}=\frac{\partial{J}}{\partial{\hat{y}}}\frac{\partial{\hat{y}}}{\partial{z}} ∂z∂J=∂y^∂J∂z∂y^
- ∂ J ∂ y ^ = − y y ^ + 1 − y 1 − y ^ \frac{\partial{J}}{\partial{\hat{y}}}=\frac{-y}{\hat{y}}+\frac{1-y}{1-\hat{y}} ∂y^∂J=y^−y+1−y^1−y \quad \quad ∂ y ^ ∂ z = y ^ ( 1 − y ^ ) \frac{\partial{\hat{y}}}{\partial{z}}=\hat{y}(1-\hat{y}) ∂z∂y^=y^(1−y^)
- ∂ J ∂ z = ∂ J ∂ y ^ ∂ y ^ ∂ z = − y ( 1 − y ^ ) + ( 1 − y ) y ^ = y ^ − y \frac{\partial{J}}{\partial{z}}=\frac{\partial{J}}{\partial{\hat{y}}}\frac{\partial{\hat{y}}}{\partial{z}}=-y(1-\hat{y})+(1-y)\hat{y}=\hat{y}-y ∂z∂J=∂y^∂J∂z∂y^=−y(1−y^)+(1−y)y^=y^−y
2. Calculation z z z About w w w and b b b The derivative of
- ∂ z ∂ w 1 = x 1 \frac{\partial{z}}{\partial{w_1}}=x1 ∂w1∂z=x1 \quad \quad ∂ z ∂ w 2 = x 2 \frac{\partial{z}}{\partial{w_2}}=x2 ∂w2∂z=x2 \quad \quad ∂ z ∂ w 3 = x 3 \frac{\partial{z}}{\partial{w_3}}=x3 ∂w3∂z=x3
- ∂ z ∂ b = 1 \frac{\partial{z}}{\partial{b}}=1 ∂b∂z=1
3. Calculation J J J About w w w and b b b The derivative of
- ∂ J ∂ w 1 = ∂ J ∂ z ∂ z ∂ w 1 = ( y ^ − y ) x 1 \frac{\partial{J}}{\partial{w_1}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{w_1}}=(\hat{y}-y)x_1 ∂w1∂J=∂z∂J∂w1∂z=(y^−y)x1
- ∂ J ∂ w 2 = ∂ J ∂ z ∂ z ∂ w 2 = ( y ^ − y ) x 2 \frac{\partial{J}}{\partial{w_2}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{w_2}}=(\hat{y}-y)x_2 ∂w2∂J=∂z∂J∂w2∂z=(y^−y)x2
- ∂ J ∂ w 3 = ∂ J ∂ z ∂ z ∂ w 3 = ( y ^ − y ) x 3 \frac{\partial{J}}{\partial{w_3}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{w_3}}=(\hat{y}-y)x_3 ∂w3∂J=∂z∂J∂w3∂z=(y^−y)x3
- ∂ J ∂ b = ∂ J ∂ z ∂ z ∂ b = ( y ^ − y ) \frac{\partial{J}}{\partial{b}}=\frac{\partial{J}}{\partial{z}}\frac{\partial{z}}{\partial{b}}=(\hat{y}-y) ∂b∂J=∂z∂J∂b∂z=(y^−y)
Vectorization realizes gradient calculation
In the last section , For a single data sample , This paper introduces how to calculate the gradient . But in practice , There can't be only one data sample , Therefore, it is necessary to calculate the gradient based on the loss function of all data samples . In this section, the gradient calculation of multiple data samples will be realized by vectorization , Vectorization is relative to for The way of circulation , Can save a lot of time , Improve the efficiency of operation .
First, declare the structure of the data :
- input data X ∈ R n × m X \in R^{n \times m} X∈Rn×m(m Samples ,n Eigenvalues )
- label Y ∈ R m × 1 Y \in R^{m \times 1} Y∈Rm×1
- The weight W ∈ R n × 1 W \in R^{n \times1} W∈Rn×1
- bias b ∈ R b \in R b∈R
The gradient descent process is as follows :
- Z = W T X + b Z=W^TX+b Z=WTX+b \quad Z ∈ R 1 × m Z\in R^{1\times m} Z∈R1×m
- Y ^ = σ ( Z ) \hat{Y}=\sigma(Z) Y^=σ(Z) \quad Y ^ ∈ R 1 × m \hat{Y}\in R^{1\times m} Y^∈R1×m
- ∂ J ∂ z = Y ^ − Y \frac{\partial{J}}{\partial{z}}=\hat{Y}-Y ∂z∂J=Y^−Y \quad ∂ J ∂ z ∈ R 1 × m \frac{\partial{J}}{\partial{z}}\in R^{1\times m} ∂z∂J∈R1×m
- ∂ J ∂ W = 1 m X ( Y ^ − Y ) T \frac{\partial{J}}{\partial{W}}=\frac{1}{m}X(\hat{Y}-Y)^T ∂W∂J=m1X(Y^−Y)T \quad ∂ J ∂ W ∈ R n × 1 \frac{\partial{J}}{\partial{W}}\in R^{n\times 1} ∂W∂J∈Rn×1
- ∂ J ∂ b = 1 m n p . s u m ( Y ^ − Y ) \frac{\partial{J}}{\partial{b}}=\frac{1}{m}np.sum(\hat{Y}-Y) ∂b∂J=m1np.sum(Y^−Y) \quad ∂ J ∂ b ∈ R \frac{\partial{J}}{\partial{b}}\in R ∂b∂J∈R
- W = W − α ∂ J ∂ W W = W-\alpha \frac{\partial{J}}{\partial{W}} W=W−α∂W∂J \quad b = b − α ∂ J ∂ b b= b-\alpha \frac{\partial{J}}{\partial{b}} b=b−α∂b∂J
Logistic regression code implementation
Obtain secondary classification data
from sklearn.datasets import load_iris,make_classification
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
# Generate 500 A sample points , The sample category is only 2 Kind of , The eigenvalues of each sample are 4 individual
X,Y=make_classification(n_samples=500,n_features=4,n_classes=2)
# take 30% As a test set ,70% Data as a training set
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
print("X:",X.shape)
print("Y:",Y.shape)
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)
The output is :
X: (500, 4)
Y: (500,)
x_train: (350, 4)
x_test: (150, 4)
y_train: (350,)
y_test: (150,)
Define initialization module
def initialize_with_zeros(shape):
""" Create a shape as (shape, 1) Of w Parameters and b=0. return:w, b """
w = np.zeros((shape, 1))
b = 0
return w, b
Define loss function and gradient
def basic_sigmoid(x):
""" Calculation sigmoid function """
s = 1 / (1 + np.exp(-x))
return s
def propagate(w, b, X, Y):
""" Parameters :w,b,X,Y: Network parameters and data Return: Loss cost、 Parameters W Gradient of dw、 Parameters b Gradient of db """
m = X.shape[1]
# w (n,1), x (n, m)
A = basic_sigmoid(np.dot(w.T, X) + b)
# Calculate the loss
cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
dz = A - Y
dw = 1 / m * np.dot(X, dz.T)
db = 1 / m * np.sum(dz)
cost = np.squeeze(cost)
grads = {
"dw": dw,
"db": db}
return grads, cost
Define gradient descent algorithm
def optimize(w, b, X, Y, num_iterations, learning_rate):
""" Parameters : w: The weight ,b: bias ,X features ,Y The target ,num_iterations The total number of iterations ,learning_rate Learning rate Returns: params: Updated parameter Dictionary grads: gradient costs: Loss results """
costs = []
for i in range(num_iterations):
# Gradient update calculation function
grads, cost = propagate(w, b, X, Y)
# Take out the gradient of two partial parameters
dw = grads['dw']
db = grads['db']
# Calculate according to the gradient descent formula
w = w - learning_rate * dw
b = b - learning_rate * db
if i % 100 == 0:
costs.append(cost)
print(" Loss results %i: %f" %(i, cost))
params = {
"w": w,"b": b}
grads = {
"dw": dw,"db": db}
return params, grads, costs
Define prediction module
def predict(w, b, X):
''' Use the trained parameters to predict return: Predicted results '''
m = X.shape[1]
y_prediction = np.zeros((1, m))
w = w.reshape(X.shape[0], 1)
# The result of the calculation is
A = basic_sigmoid(np.dot(w.T, X) + b)
for i in range(A.shape[1]):
if A[0, i] <= 0.5:
y_prediction[0, i] = 0
else:
y_prediction[0, i] = 1
return y_prediction
Define a logistic regression model
def model(x_train, y_train, x_test, y_test, num_iterations=2000, learning_rate=0.0001):
""" """
# Modify the data shape
x_train = x_train.reshape(-1, x_train.shape[0])
x_test = x_test.reshape(-1, x_test.shape[0])
y_train = y_train.reshape(1, y_train.shape[0])
y_test = y_test.reshape(1, y_test.shape[0])
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
# 1、 Initialize parameters
w, b = initialize_with_zeros(x_train.shape[0])
# 2、 gradient descent
# params: Updated network parameters
# grads: Last gradient
# costs: Loss list updated each time
params, grads, costs = optimize(w, b, x_train, y_train, num_iterations, learning_rate)
# Get training parameters
# Predicted results
w = params['w']
b = params['b']
y_prediction_train = predict(w, b, x_train)
y_prediction_test = predict(w, b, x_test)
# Print accuracy
print(" Training set accuracy : {} ".format(100 - np.mean(np.abs(y_prediction_train - y_train)) * 100))
print(" Test set accuracy : {} ".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))
return None
Run the model
model(x_train, y_train, x_test, y_test, num_iterations=3000, learning_rate=0.01)
Set the number of iterations to 3000, The learning rate is set to 0.01, Running will get the following results :
(4, 350)
(4, 150)
(1, 350)
(1, 150)
Loss results 0: 0.693147
Loss results 100: 0.685711
Loss results 200: 0.681650
Loss results 300: 0.679411
Loss results 400: 0.678165
Loss results 500: 0.677465
Loss results 600: 0.677069
Loss results 700: 0.676843
Loss results 800: 0.676713
Loss results 900: 0.676639
Loss results 1000: 0.676595
Loss results 1100: 0.676570
Loss results 1200: 0.676556
Loss results 1300: 0.676547
Loss results 1400: 0.676542
Loss results 1500: 0.676539
Loss results 1600: 0.676538
Loss results 1700: 0.676537
Loss results 1800: 0.676536
Loss results 1900: 0.676536
Loss results 2000: 0.676535
Loss results 2100: 0.676535
Loss results 2200: 0.676535
Loss results 2300: 0.676535
Loss results 2400: 0.676535
Loss results 2500: 0.676535
Loss results 2600: 0.676535
Loss results 2700: 0.676535
Loss results 2800: 0.676535
Loss results 2900: 0.676535
Training set accuracy : 60.57142857142857
Test set accuracy : 56.0
The change of loss function value is shown in the figure below :
summary
This paper introduces the principle of logistic regression in detail , And make use of Python A case of realizing logical regression . From the running results, we can see that , As the number of iterations increases , The value of the loss function does not always drop close to 0 The location of , It's stable 0.6 near . meanwhile , The prediction accuracy of the training set is 60.57%, The accuracy of the prediction for the test set is 56%. Therefore, although logical regression is simple and easy to understand , The interpretability of the model is very good , But because the form of the model is relatively simple , Can't fit the real distribution of data well , So the accuracy is often not very high .
版权声明
本文为[Stephen_ Tao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230548468772.html
边栏推荐
- .104History
- 394. String decoding - auxiliary stack
- 122. The best time to buy and sell stocks II - one-time traversal
- 470. Rand10() is implemented with rand7()
- 958. Complete binary tree test
- Read software engineering at Google (15)
- Some questions some questions some questions some questions
- Future usage details
- Construction of functions in C language programming
- Matlab / Simulink simulation of double closed loop DC speed regulation system
猜你喜欢
Gets the time range of the current week
470. Rand10() is implemented with rand7()
Summary of floating point double precision, single precision and half precision knowledge
Kubernetes 服务发现 监控Endpoints
Click Cancel to return to the previous page and modify the parameter value of the previous page, let pages = getcurrentpages() let prevpage = pages [pages. Length - 2] / / the data of the previous pag
Element calculation distance and event object
Halo 开源项目学习(二):实体类与数据表
Tdan over half
关于gcc输出typeid完整名的方法
云原生虚拟化:基于 Kubevirt 构建边缘计算实例
随机推荐
关于gcc输出typeid完整名的方法
239. Maximum value of sliding window (difficult) - one-way queue, large top heap - byte skipping high frequency problem
ES6
Commonly used functions -- spineros:: and spineros::)
Summary of common SQL statements
2022年上海市安全员C证操作证考试题库及模拟考试
Applet learning notes (I)
Dry goods | how to extract thumbnails quickly?
Index: teach you index from zero basis to proficient use
JVM class loading mechanism
In JS, t, = > Analysis of
云原生虚拟化:基于 Kubevirt 构建边缘计算实例
The JS timestamp of wechat applet is converted to / 1000 seconds. After six hours and one day, this Friday option calculates the time
Compilation principle first set follow set select set prediction analysis table to judge whether the symbol string conforms to the grammar definition (with source code!!!)
Halo open source project learning (II): entity classes and data tables
587. 安装栅栏 / 剑指 Offer II 014. 字符串中的变位词
JS forms the items with the same name in the array object into the same array according to the name
圆环回原点问题-字节跳动高频题
Sword finger offer 03 Duplicate number in array
958. Complete binary tree test