当前位置：网站首页>On the problem of cliff growth of loss function in the process of training

On the problem of cliff growth of loss function in the process of training

2022-04-23 14:13:00 【All the names I thought of were used】

The reasons for the cliff like growth of loss function in the process of training

( One ). Since the loss function is nonconvex , Setting the learning rate too large leads to jumping out of the interval of the optimal solution , We can choose an optimization algorithm that dynamically changes the learning rate , such as adam

( Two ) When the gradient explosion occurs in the training process, the loss will also increase like a cliff

Reasons for gradient explosion or disappearance

The root cause ： When we take improper training methods, ah, resulting in the disappearance of the gradient in the front layer , The model will greatly adjust the parameters of the next few layers , Cause the gradient to be too large , Finally, there is a gradient explosion

Be careful ： The gradient disappears in the first few layers , Gradient explosions occur in the latter layers

Solutions

Be careful ： Gradient truncation method is also an important means to prevent gradient explosion
1. Select the appropriate distribution to initialize the parameters ,w Too large can easily lead to gradient explosion or disappearance , For example, use tanh When activating the function ,w To cause to z Too big , The derivative tends to 0

2. use BN The way , Make the input and output keep the same distribution as much as possible , Slowing the appearance of gradient disappearance can also avoid gradient explosion or disappearance ( Very easy to use )
3. According to the chain rule , When we w The value of is small ,a The derivative of will also be smaller ,a The smaller the derivative of the previous layer w The smaller the gradient , So we can use L1、L2 Regularization to slow down the gradient explosion
4. Choose the appropriate activation function ,relu Is the most commonly used activation function
5. When the effect is similar , The simpler the neural network is, the less prone it is to gradient explosion and gradient disappearance

版权声明
本文为[All the names I thought of were used]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204231404419479.html