当前位置：网站首页>2.1 - Gradient Descent

2.1 - Gradient Descent

2022-08-11 07:51:00 【A big boa constrictor 6666】

文章目录

一、调整学习率
- 1.1 自适应学习率（Adaptive Learning Rates）
- 1.2 Adagrad
二、随机梯度下降（Stochastic Gradient Descent）
三、特征缩放（Feature Scaling）
四、数学推导
五、梯度下降的限制（More Limitation of Gradient Descent）

Return to that section in the previous chapter,We discussed the process of how to find a best model,That is, to find a set of parametersθ,让这个loss函数越小越好：
$θ^{*}=arg\underset{θ}{min}L(θ)$
当θ有两个参数 $\{\theta _{1},\theta _{2}\}$ 时,A set of starting points is randomly selected $\theta ^{0}=\begin{bmatrix} \theta _{1}^{0}\\ \theta _{2}^{0} \end{bmatrix}$ ,上标0Represents the initial set of parameters,下标1,2Represents are the first and second parameters in this set of parameters.

接下来计算 $\{\theta _{1},\theta _{2}\}$ their respective partial differentials：

$\begin{bmatrix} \theta _{1}^{1}\\ \theta _{2}^{1} \end{bmatrix}=\begin{bmatrix} \theta _{1}^{0}\\ \theta _{2}^{0} \end{bmatrix}-\eta \begin{bmatrix} \frac{\partial L(\theta _{1}^{0})}{\partial \theta _{1}}\\ \frac{\partial L(\theta _{2}^{0})}{\partial \theta _{2}} \end{bmatrix}$
$\begin{bmatrix} \theta _{1}^{2}\\ \theta _{2}^{2} \end{bmatrix}=\begin{bmatrix} \theta _{1}^{1}\\ \theta _{2}^{1} \end{bmatrix}-\eta \begin{bmatrix} \frac{\partial L(\theta _{1}^{1})}{\partial \theta _{1}}\\ \frac{\partial L(\theta _{2}^{1})}{\partial \theta _{2}} \end{bmatrix}$

对于 $\{\theta _{1},\theta _{2}\}$ There is another way of writing the partial differential of ： $\bigtriangledown L(\theta)$ Also called gradient（Gradient）,represents a set of vectors（vector）
$\bigtriangledown L(\theta)=\begin{bmatrix} \frac{\partial L(\theta _{1})}{\partial \theta _{1}}\\ \frac{\partial L(\theta _{2})}{\partial \theta _{2}} \end{bmatrix}$

$\theta ^1=\theta ^0-\eta\bigtriangledown L(\theta ^0)$
$\theta ^2=\theta ^1-\eta\bigtriangledown L(\theta ^1)$

The figure below is gradient descent（Gradient Descent）的可视化过程：The red arrows represent the direction of the gradient,The blue arrows represent the direction of parameter update,两者是相反的.

一、调整学习率

当你有3and more than one parameter,There is no way to visualize gradient descent（Gradient Descent）的过程的.However, the learning rate can be visualized（learning rate）η和Losscurve between values.

1.1 自适应学习率（Adaptive Learning Rates）

在开始时,Since we are far from the optimal solution,Therefore, a larger learning rate can be used,Increase the pace of gradient descent.
After a few rounds of training,We are now closer to the optimal solution,So reduce the learning rate,Decrease the pace of gradient descent,Avoid oscillating around the optimal solution.
The simplest strategy is to let the learning rate vary over time,如： $\eta^t=\frac{\eta}{\sqrt{t+1}}$
But not all parameters are suitable for such a set of adjustment strategies

1.2 Adagrad

Adagradis the root mean square of each parameter's learning rate divided by its previous derivative value（root mean square ）

Let's take a look at ordinary gradient descent（Vanilla Gradient descent）和Adagrad之间的区别：

Vanilla Gradient descent：
- $Font metrics not found for font: .$
Adagrad： $\sigma ^t$ 代表参数wThe root mean square of all previous differential values（root mean square ）
- $Font metrics not found for font: .$
下面是AdagradThe specific derivation process and the final simplified writing method：

There are some inconsistencies in the final simplified notation（Contradiction）之处：当梯度g越大时,We are expecting a bigger pace of decline,But the denominator of the formula is preventing us from doing so.

对于这个问题,There are two such explanations：
- 直观的解释（Intuitive Reason）：In order to emphasize the contrast effect of a certain gradient（特别大或者特别小）,We added the term denominator
- 更正式的解释：
  - 对于一个参数：When the steps we step out are proportional to the magnitude of the differential,Then it may be the best pace.
  - Compare different parameters（Comparison between different parameters）：In order to truly reflect the distance between the location and the lowest point,We're not just proportional to the gradientga derivative of ,It is also inversely proportional to the gradientg的二次微分.
The figure below is an explanation of the denominator term in the formula to estimate the quadratic differential：When sampling enough points,梯度gThe squared sum of , and then the square root can be approximately equal to the gradientg的二次微分

二、随机梯度下降（Stochastic Gradient Descent）

Stochastic gradient descent is much faster than normal gradient descent：
- This is because for stochastic gradient descent,It will look at each oneexample的loss值,相当于走了20步

三、特征缩放（Feature Scaling）

将x₁和x₂Two different feature distributions are scaled to the same scale：
- 如右图,The point of this is to make it easier for us to do gradient descent,更加有效率.Because the starting point after feature scaling starts no matter where,The direction of gradient descent is always pointing to the lowest point.

Z分数归一化（Z-Score Normalization）
- ZScore normalization is a common way to achieve feature scaling.It works by calculating each dimension（行）特征的均值m_i和标准差σ_i,Then convert each value of the feature matrix $x_i^r$ All minus the meanm_i并除以标准差σ_i,这样更新后的 $x_i^r$ are all in0~1之间了.

四、数学推导

泰勒级数（Taylor Series）：
- 设h(x)is an arbitrary functionx=x₀Differentiable nearby,那么h(x)可以写成下面这样

多变量泰勒级数（Multivariable Taylor Series）：
- According to the definition of Taylor series,If the red circle in the picture on the right is small enough,Then you can put it in the red circlelossThe function is simplified using Taylor series
- 简化之后,It becomes a search within the red circle $\theta_1,\theta_2$ 使得lossproblem with the smallest value
- when the derivation reaches the last step,It becomes the formula we did gradient descent before,But the premise of the establishment of the formula is the radius of the red circler要足够小,由于学习率 $\eta$ 是和r成正比的,So the learning rate cannot be too large,Theoretically, it takes infinite hours for the formula to hold,But in the actual operation process as long as it is small enough.
- The above derivation process we use is the first-order form of Taylor series,When considering the quadratic,三次,even multiple times,Our requirements for the red circle are not so great,Theoretically, the learning rate can also be adjusted higher.But this is rarely done in deep learning,This is because the huge amount of computation that comes with it is unbearable.

五、梯度下降的限制（More Limitation of Gradient Descent）

In the process of doing gradient descent,Actually looking for itloss函数微分为0的地方,However, the differential is0is not necessarily a local optimal solution,It could also be a saddle point in the graph.
Also in the actual solution process,We're not looking for differentials to really do0的点,But when the differential is less than a certain number（如10的-6次方）的点,In fact, this point may still be a relatively high place,It is still far from the local optimal solution to be found.
We will continue to discuss this issue in the next chapter