当前位置:网站首页>Deep learning notes (II) -- principle and implementation of activation function
Deep learning notes (II) -- principle and implementation of activation function
2022-04-23 03:28:00 【Two liang crispy pork】
Deep learning notes ( Two )—— Principle and implementation of activation function
gossip
Yesterday, I pushed down the cross entropy in detail , I think it's OK , Continue to refuel today .
ReLU
principle
Definition :ReLU Is a modified linear element (rectified linear unit), stay 0 and x Take the maximum value between .
Reasons for appearance : because sigmoid and tanh Gradients tend to disappear , To train deep neural networks , Need an activation function neural network , It looks and behaves like a linear function , But it's actually a nonlinear function , Allows you to learn complex relationships in data . The function must also provide more sensitive activation and input , Avoid saturation . and ReLU yes Unsaturated activation function , Gradient disappearance is not easy to occur .
def ReLU(input):
if input>0:
return input
else:
return 0
ReLU Function expression for :
When x <= 0 when ,ReLU = 0
When x > 0 when ,ReLU = x
ReLU Derivative expression of :
When x<= 0 when , Derivative is 0
When x > 0 when , Derivative is 1
advantage
Simple calculation
because ReLU Just one max() Function can complete the operation , and tanh and sigmoid Exponential operation is required , therefore ReLU The calculation cost is very low .
The representation is sparse
ReLU True zero value can be output when negative input , Allow the hidden layer in the neural network to contain one or more true zero values , This is called sparse representation . It means learning , Because it can speed up learning and simplify models . Effectively alleviate the problem of over fitting , because ReLU It is possible to change the output of some nerve nodes into 0, This leads to the death of nerve nodes , It reduces the complexity of neural network .
Linear behavior
When the input is greater than zero ,ReLU It looks like a linear function , When the network behavior is nearly linear , Easier to optimize , There will be no gradient disappearance or gradient explosion , When x Greater than 0 when ,ReLU The gradient of is constant 1, The gradient will not become smaller or larger as the network depth deepens , Thus, no gradient disappearance or gradient explosion will occur , Therefore, it can be more suitable for training deep network . Now in deep learning , The default activation function is ReLU.
shortcoming
Not suitable for RNN Class network
Yes MLPs,CNNs Use ReLU, But it's not RNNs.ReLU It can be used in most types of neural networks , It is usually used as the activation function of multilayer perceptron neural network and convolutional neural network . Traditionally ,LSTMs Use tanh Activate function to activate cell state , Use Sigmoid Activate function as node Output . and ReLU Usually not suitable for RNN Use of type network .
Code
import torch
import torch.nn as nn
relu = nn.ReLU(inplace=True)
Sigmoid
principle
Sigmoid It's a common nonlinear activation function , You can map all real numbers to (0, 1) On interval , It uses nonlinear method to normalize the data ;sigmoid Functions are usually used in Regression prediction and binary classification ( That is, according to whether it is greater than 0.5 To classify ) In the output layer of the model .
Function formula :
def Sigmoid(x):
return 1. / (1 + np.exp(-x))
diagram :
derivative :
Derivative image :
advantage
Derivative easily
Gradient smoothing , Derivative easily
Optimization and stability
Sigmoid The output of the function is mapped to (0,1) Between , Monotone continuous , Limited output range , Optimization and stability , Can be used as output layer
shortcoming
Large amount of computation
Both forward propagation and back propagation contain power operation and division , Therefore, there are large computing resources .
The gradient disappears
The gradient disappears : Enter a larger or smaller value ( Both sides of the image ) when ,sigmoid The derivative is close to zero , So in back propagation , This local gradient will be multiplied by the gradient of the whole cost function with respect to the output of the unit , The result will also be close to 0 , The purpose of updating parameters cannot be achieved ;
This can be seen from its derivative image , If we initialize the neural network with a weight of [0,1] [0,1][0,1] The random value between , It can be seen from the mathematical derivation of back propagation algorithm , When the gradient propagates from back to front , The gradient value of each transfer layer will be reduced to the original 0.25 times , If there are many hidden layers in neural network , Then the gradient will become very small and close to 0, That is, the gradient disappears ;
Code
import torch
import torch.nn as nn
sigmoid=nn.Sigmoid()
Softmax
principle
Sigmoid Function can only handle two categories , This does not apply to multi classification problems , therefore Softmax Can solve this problem effectively .Softmax Functions are often used in the last layer of neural networks , Make the probability value of each category in (0, 1) Between .Softmax = Multi category classification problem = There is only one correct answer = Mutually exclusive output ( For example, handwritten numbers , Iris ). Building classifiers , When solving a problem with only one correct answer , use Softmax The function processes the raw output values .Softmax The denominator of the function synthesizes all the factors of the original output value , It means ,Softmax The different probabilities obtained by the function are related to each other . That is, the sum of all the probabilities obtained is 1, The example looks at the code section
def softmax(x):
return np.exp(x) / sum(np.exp(x))
advantage
It has a wide range of applications
Compare with sigmoid,softmax Can deal with multiple categories , Therefore, it can be applied to more fields .
Open the gap between values
because Softmax The function first widens the difference between the elements of the input vector ( By exponential function ), And then normalized to a probability distribution , When applied to classification problems , It makes the probability difference of each category more significant , The probability of maximum production is closer to 1, In this way, the output distribution is closer to the real distribution . Because it opens the gap , This is convenient to improve loss, Can get more learning results .
Code
give an example :
Suppose a three class dataset of chickens, ducks and geese , If one batch There are three inputs , Through the previous network , The result is ([10,8,6],[7,9,5],[5,4,10]).
The actual tag value is [0,1,2], That is, chicken, duck and goose .
Means : The network speculates that the probability of chicken is [86.68%], duck [11.73%], goose [1.59%], The last two lines are the same .
import torch
import torch.nn as nn
import math
import time
# Suppose it turns out that
input_1=torch.Tensor([0,1,2])
# Predicted results
input_2 = torch.Tensor([
[10,8,6],
[7,9,5],
[5,4,10]
])
# Specify... In different directions softmax
softmax = nn.Softmax(dim=1)
# Tensor experiments on different dimensions
output = softmax(input_2)
print(output)
#------------#
#tensor([[0.8668, 0.1173, 0.0159],
# [0.1173, 0.8668, 0.0159],
# [0.0067, 0.0025, 0.9909]])
#------------#
Tanh
principle
The formula :
def tanh(x):
return np.sinh(x)/np.cosh(x)
Function image
Function form after first-order derivation :
Derivative function image :
advantage
Fast convergence
Than Sigmoid Function converges faster
The gradient vanishing problem is lighter
tanh(x) Gradient vanishing problem ratio sigmoid Be light . Because you can see ,sigmoid In the near 0 Each layer will shrink 0.25 times , and tanh Be accessible to 1, So in deep networks , The gradient vanishing problem is alleviated .
Output to 0 Centered
comparison Sigmoid function , The output is in the form of 0 Centered zero-centered
shortcoming
The calculation consumption is large
It can be seen that it has carried out several exponential operations , It consumes a relatively large calculation cost .
The gradient disappears and still exists
As can be seen from the derivative image , Its derivative image and sigmoid There are great similarities , Still unchanged Sigmoid The biggest problem with functions —— The gradient due to saturation disappears .
Code
import torch
import torch.nn as nn
relu = nn.tanh()
LeakyReLU
principle
Leaky ReLU To solve “ ReLU Death ” An attempt at a problem . Mainly to avoid death ReLU:x Less than 0 When , The derivative is a small number , instead of 0.
Usually a by 0.01.
def ReLU(input):
if input>0:
return input
else:
return 0.01*input
advantage
Low computing cost
Fast calculation : Does not include exponential operations .
Can get a negative output
Compare with ReLU You can get a negative output
Have ReLU Other benefits
shortcoming
a It's a super parameter. , It needs to be set manually
Both parts are linear
Code
import torch
import torch.nn as nn
LR=nn.LeakyReLU(inplace=True)
ReLU6
principle
Mainly for the mobile terminal float16 When the precision is low , It also has good numerical resolution , If the ReLu The output value of is unlimited , Then the output range is 0 To infinity , And low precision float16 Its value cannot be accurately described , Bring precision loss .
characteristic
Basic and ReLU identical , Only the upper limit of the output value is limited .
def ReLU(input):
if input>0:
return min(6,input)
else:
return 0
Code
"""pytorch neural network """
import torch.nn as nn
Re=nn.ReLU6(inplace=True)
ELU
principle
LU It also solves the problem of ReLU The problem of . And ReLU comparison ,ELU There is a negative value , This brings the average value of activation close to zero . Mean activation close to zero can make learning faster , Because they make gradients closer to natural gradients .
Function image :
advantage
It's solved Dead ReLU
No, Dead ReLU problem , Close to the average output 0, With 0 Centered .
Speed up learning
ELU By reducing the effect of offset , Make the normal gradient closer to the unit natural gradient , So that the mean value can be accelerated to zero .
Reduce variation and information
ELU The function saturates to a negative value with a small input , So as to reduce the forward propagation of variation and information .
shortcoming
The amount of calculation becomes larger
ELU Functions are more computationally intensive . And Leaky ReLU similar , Although theoretically better than ReLU It is better to , But at present, there is no sufficient evidence in practice to show that ELU It's always better than ReLU good .
Code
import torch
import torch.nn as nn
LR=nn.ELU()
# ELU Function in numpy The realization of
import numpy as np
import matplotlib.pyplot as plt
def elu(x, a):
y = x.copy()
for i in range(y.shape[0]):
if y[i] < 0:
y[i] = a * (np.exp(y[i]) - 1)
return y
if __name__ == '__main__':
x = np.linspace(-50, 50)
a = 0.5
y = elu(x, a)
print(y)
plt.plot(x, y)
plt.title('elu')
plt.axhline(ls='--',color = 'r')
plt.axvline(ls='--',color = 'r')
# plt.xticks([-60,60]),plt.yticks([-10,50])
plt.show()
SELU
principle
SELU From the paper Self-Normalizing Neural Networks, The author is Sepp Hochreiter,ELU Also from their group .
SELU In fact, that is ELU ride lambda, The point is this lambda It is greater than 1 Of , It's given in the paper lambda and alpha Value :
- lambda = 1.0507
- alpha = 1.67326
Function image :
advantage
Self normalization
SELU Activation can self normalize the neural network (self-normalizing);
No gradient disappearance or explosion
There is no possibility of gradient disappearance or explosion , The theorem in the appendix of the paper 2 and 3 Provided proof that .
shortcoming
Less application
Less application , Need more validation ;
It needs to be initialized
lecun_normal and Alpha Dropout: need lecun_normal Initialize the weight ; If dropout, You have to use Alpha Dropout Special version of .
Code
import torch
import torch.nn as nn
SELU=nn.SELU()
Parametric ReLU (PRELU)
principle
Formally and Leak_ReLU Similar in form , The difference is :PReLU Parameters of alpha It's learnable , Need to update according to gradient .
- alpha=0: Degenerate to ReLU
- alpha Fixed not updated , Degenerate to Leak_ReLU
advantage
And ReLU identical .
shortcoming
In different problems , Different performances .
Code
import torch
import torch.nn as nn
PReLU= nn.PReLU()
Gaussian Error Linear Unit(GELU)
principle
Gaussian error linear element activation function in the recent Transformer Model ( Google's BERT and OpenAI Of GPT-2) It has been applied to .GELU My paper comes from 2016 year , But it has only recently attracted attention .
advantage
The effect is good
It seems to be NLP The current best in the field ; Especially in Transformer The best model ;
Avoid gradients disappearing
It can avoid the problem of gradient disappearance .
shortcoming
This 2016 The novel activation function proposed in still lacks the test of practical application .
Code
import torch
import torch.nn as nn
gelu_f== nn.GELU()
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from matplotlib import pyplot as plt
class GELU(nn.Module):
def __init__(self):
super(GELU, self).__init__()
def forward(self, x):
return 0.5*x*(1+F.tanh(np.sqrt(2/np.pi)*(x+0.044715*torch.pow(x,3))))
def gelu(x):
return 0.5*x*(1+np.tanh(np.sqrt(2/np.pi)*(x+0.044715*np.power(x,3))))
x = np.linspace(-4,4,10000)
y = gelu(x)
plt.plot(x, y)
plt.show()
Swish
principle
Swish The activation function was born in Google Brain 2017 The paper of Searching for Activation functions in , The definition for :
β Is a constant or trainable parameter .
Swish The effect on the deep model is better than ReLU. for example , Only use Swish Unit replacement ReLU Can handle Mobile NASNetA stay ImageNet Upper top-1 The classification accuracy is improved 0.9%,Inception-ResNet-v The classification accuracy of 0.6%.
characteristic
Swish There is no upper bound, there is a lower bound 、 smooth 、 Nonmonotonic properties .
Code
import torch
import torch.nn as nn
class Swish(nn.Module):
def __init__(self):
super(Swish, self).__init__()
def forward(self, x):
x = x * F.sigmoid(x)
return x
Swi=Swish()
Function derivation drawing code
Yes
import sympy as sy # It is used for mathematical calculations such as derivative and integral
import matplotlib.pyplot as plt # mapping
import numpy as np
x = sy.symbols('x') # Define symbolic variables x
#sigmoid function
f = 1. / (1 + sy.exp(-x))
df = sy.diff(f,x) # Find the first derivative
#ddf = sy.diff(df,x)
ddf = sy.diff(f,x,2) # Take the second derivative , Parameters 2 Is the order of derivation
# Create an empty list to save data
x_value = [] # Save arguments x The value of
f_value = [] # Save function values f The value of
df_value = [] # Save the first derivative df The value of
for i in np.arange(-10,10,0.01):
x_value.append(i)# Yes x Carry out the value
f_value.append(f.subs('x',i))# take i Value into the expression
df_value.append(df.subs('x',i))# take i Value into the derivation expression
# It takes two lines of code to display Chinese normally
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
plt.title('sigomid Function image ')
plt.xlabel('x')
plt.ylabel('f')
plt.plot(x_value,f_value)
plt.show()
plt.title('sigmoid First derivative image ')
plt.xlabel('x')
plt.ylabel('df')
plt.plot(x_value,df_value)
plt.show()
版权声明
本文为[Two liang crispy pork]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230326525434.html
边栏推荐
- TCP three handshakes and four waves
- socket编程 send()与 recv()函数详解
- ThreadLocal test multithreaded variable instance
- Téléchargement en vrac de fichiers - téléchargement après compression
- . net webapi access authorization mechanism and process design (header token + redis)
- Romantic silhouette of L2-3 of 2022 group programming ladder Simulation Competition (25 points)
- Utgard connection opcserver reported an error caused by: org jinterop. dcom. common. JIRuntimeException: Access is denied. [0x800
- QT uses drag and drop picture to control and mouse to move picture
- Eight elder brothers chronicle [4]
- Why is bi so important to enterprises?
猜你喜欢
12. < tag linked list and common test site synthesis > - lt.234 palindrome linked list
QT dynamic translation of Chinese and English languages
Chapter 7 of C language programming (fifth edition of Tan Haoqiang) analysis and answer of modular programming exercises with functions
淺學一下I/O流和File類文件操作
C interface
Codeforces round 784 (Div. 4) (AK CF (XD) for the first time)
Visual programming - Experiment 1
移植tslib时ts_setup: No such file or directory、ts_open: No such file or director
IDEA查看历史记录【文件历史和项目历史】
超好用的Excel异步导出功能
随机推荐
2022 group programming ladder simulation l2-1 blind box packaging line (25 points)
Learn about I / O flow and file operations
Iotos IOT middle platform is connected to the access control system of isecure center
Docker pulls MySQL and connects
深度学习笔记(二)——激活函数原理与实现
Unity Basics
Problem B: small challenge
JS recursive tree structure calculates the number of leaf nodes of each node and outputs it
MySql关键字GROUP_CONCAT,组合连接查询
2021-08-31
Advanced sorting - fast sorting
Punch in: 4.23 C language chapter - (1) first knowledge of C language - (12) structure
Unity knowledge points (ugui 2)
QT dynamic translation of Chinese and English languages
Chapter 8 of C language programming (fifth edition of Tan Haoqiang) is good at using pointer exercises to analyze and answer
Code forces round # 784 (DIV. 4) solution (First AK CF (XD)
月薪10k-20k都无法回答的事务问题,你会吗?
全新的ORM框架——BeetlSQL介绍
Huawei mobile ADB devices connection device is empty
可以接收多種數據類型參數——可變參數