当前位置:网站首页>2.1 - Gradient Descent
2.1 - Gradient Descent
2022-08-11 07:51:00 【A big boa constrictor 6666】
文章目录
Return to that section in the previous chapter,We discussed the process of how to find a best model,That is, to find a set of parametersθ,让这个loss函数越小越好:
θ ∗ = a r g m i n θ L ( θ ) θ^{*}=arg\underset{θ}{min}L(θ) θ∗=argθminL(θ)
当θ有两个参数 { θ 1 , θ 2 } \{\theta _{1},\theta _{2}\} { θ1,θ2}时,A set of starting points is randomly selected θ 0 = [ θ 1 0 θ 2 0 ] \theta ^{0}=\begin{bmatrix} \theta _{1}^{0}\\ \theta _{2}^{0} \end{bmatrix} θ0=[θ10θ20],上标0Represents the initial set of parameters,下标1,2Represents are the first and second parameters in this set of parameters.
接下来计算 { θ 1 , θ 2 } \{\theta _{1},\theta _{2}\} { θ1,θ2}their respective partial differentials:
[ θ 1 1 θ 2 1 ] = [ θ 1 0 θ 2 0 ] − η [ ∂ L ( θ 1 0 ) ∂ θ 1 ∂ L ( θ 2 0 ) ∂ θ 2 ] \begin{bmatrix} \theta _{1}^{1}\\ \theta _{2}^{1} \end{bmatrix}=\begin{bmatrix} \theta _{1}^{0}\\ \theta _{2}^{0} \end{bmatrix}-\eta \begin{bmatrix} \frac{\partial L(\theta _{1}^{0})}{\partial \theta _{1}}\\ \frac{\partial L(\theta _{2}^{0})}{\partial \theta _{2}} \end{bmatrix} [θ11θ21]=[θ10θ20]−η[∂θ1∂L(θ10)∂θ2∂L(θ20)]
[ θ 1 2 θ 2 2 ] = [ θ 1 1 θ 2 1 ] − η [ ∂ L ( θ 1 1 ) ∂ θ 1 ∂ L ( θ 2 1 ) ∂ θ 2 ] \begin{bmatrix} \theta _{1}^{2}\\ \theta _{2}^{2} \end{bmatrix}=\begin{bmatrix} \theta _{1}^{1}\\ \theta _{2}^{1} \end{bmatrix}-\eta \begin{bmatrix} \frac{\partial L(\theta _{1}^{1})}{\partial \theta _{1}}\\ \frac{\partial L(\theta _{2}^{1})}{\partial \theta _{2}} \end{bmatrix} [θ12θ22]=[θ11θ21]−η[∂θ1∂L(θ11)∂θ2∂L(θ21)]
对于 { θ 1 , θ 2 } \{\theta _{1},\theta _{2}\} { θ1,θ2}There is another way of writing the partial differential of : ▽ L ( θ ) \bigtriangledown L(\theta) ▽L(θ)Also called gradient(Gradient),represents a set of vectors(vector)
▽ L ( θ ) = [ ∂ L ( θ 1 ) ∂ θ 1 ∂ L ( θ 2 ) ∂ θ 2 ] \bigtriangledown L(\theta)=\begin{bmatrix} \frac{\partial L(\theta _{1})}{\partial \theta _{1}}\\ \frac{\partial L(\theta _{2})}{\partial \theta _{2}} \end{bmatrix} ▽L(θ)=[∂θ1∂L(θ1)∂θ2∂L(θ2)]
- θ 1 = θ 0 − η ▽ L ( θ 0 ) \theta ^1=\theta ^0-\eta\bigtriangledown L(\theta ^0) θ1=θ0−η▽L(θ0)
- θ 2 = θ 1 − η ▽ L ( θ 1 ) \theta ^2=\theta ^1-\eta\bigtriangledown L(\theta ^1) θ2=θ1−η▽L(θ1)
The figure below is gradient descent(Gradient Descent)的可视化过程:The red arrows represent the direction of the gradient,The blue arrows represent the direction of parameter update,两者是相反的.
一、调整学习率
- 当你有3and more than one parameter,There is no way to visualize gradient descent(Gradient Descent)的过程的.However, the learning rate can be visualized(learning rate)η和Losscurve between values.
1.1 自适应学习率(Adaptive Learning Rates)
- 在开始时,Since we are far from the optimal solution,Therefore, a larger learning rate can be used,Increase the pace of gradient descent.
- After a few rounds of training,We are now closer to the optimal solution,So reduce the learning rate,Decrease the pace of gradient descent,Avoid oscillating around the optimal solution.
- The simplest strategy is to let the learning rate vary over time,如: η t = η t + 1 \eta^t=\frac{\eta}{\sqrt{t+1}} ηt=t+1η
- But not all parameters are suitable for such a set of adjustment strategies
1.2 Adagrad
- Adagradis the root mean square of each parameter's learning rate divided by its previous derivative value(root mean square )
Let's take a look at ordinary gradient descent(Vanilla Gradient descent)和Adagrad之间的区别:
- Vanilla Gradient descent:
- Font metrics not found for font: .
- Adagrad: σ t \sigma ^t σt代表参数wThe root mean square of all previous differential values(root mean square )
- Font metrics not found for font: .
- 下面是AdagradThe specific derivation process and the final simplified writing method:
- There are some inconsistencies in the final simplified notation(Contradiction)之处:当梯度g越大时,We are expecting a bigger pace of decline,But the denominator of the formula is preventing us from doing so.
对于这个问题,There are two such explanations:
- 直观的解释(Intuitive Reason):In order to emphasize the contrast effect of a certain gradient(特别大或者特别小),We added the term denominator
更正式的解释:
- 对于一个参数:When the steps we step out are proportional to the magnitude of the differential,Then it may be the best pace.
- Compare different parameters(Comparison between different parameters):In order to truly reflect the distance between the location and the lowest point,We're not just proportional to the gradientga derivative of ,It is also inversely proportional to the gradientg的二次微分.
The figure below is an explanation of the denominator term in the formula to estimate the quadratic differential:When sampling enough points,梯度gThe squared sum of , and then the square root can be approximately equal to the gradientg的二次微分
二、随机梯度下降(Stochastic Gradient Descent)
- Stochastic gradient descent is much faster than normal gradient descent:
- This is because for stochastic gradient descent,It will look at each oneexample的loss值,相当于走了20步
三、特征缩放(Feature Scaling)
- 将x1和x2Two different feature distributions are scaled to the same scale:
- 如右图,The point of this is to make it easier for us to do gradient descent,更加有效率.Because the starting point after feature scaling starts no matter where,The direction of gradient descent is always pointing to the lowest point.
- Z分数归一化(Z-Score Normalization)
- ZScore normalization is a common way to achieve feature scaling.It works by calculating each dimension(行)特征的均值mi和标准差σi,Then convert each value of the feature matrix x i r x_i^r xirAll minus the meanmi并除以标准差σi,这样更新后的 x i r x_i^r xirare all in0~1之间了.
四、数学推导
- 泰勒级数(Taylor Series):
- 设h(x)is an arbitrary functionx=x0Differentiable nearby,那么h(x)可以写成下面这样
- 多变量泰勒级数(Multivariable Taylor Series):
- According to the definition of Taylor series,If the red circle in the picture on the right is small enough,Then you can put it in the red circlelossThe function is simplified using Taylor series
- 简化之后,It becomes a search within the red circle θ 1 , θ 2 \theta_1,\theta_2 θ1,θ2使得lossproblem with the smallest value
- when the derivation reaches the last step,It becomes the formula we did gradient descent before,But the premise of the establishment of the formula is the radius of the red circler要足够小,由于学习率 η \eta η是和r成正比的,So the learning rate cannot be too large,Theoretically, it takes infinite hours for the formula to hold,But in the actual operation process as long as it is small enough.
- The above derivation process we use is the first-order form of Taylor series,When considering the quadratic,三次,even multiple times,Our requirements for the red circle are not so great,Theoretically, the learning rate can also be adjusted higher.But this is rarely done in deep learning,This is because the huge amount of computation that comes with it is unbearable.
五、梯度下降的限制(More Limitation of Gradient Descent)
- In the process of doing gradient descent,Actually looking for itloss函数微分为0的地方,However, the differential is0is not necessarily a local optimal solution,It could also be a saddle point in the graph.
- Also in the actual solution process,We're not looking for differentials to really do0的点,But when the differential is less than a certain number(如10的-6次方)的点,In fact, this point may still be a relatively high place,It is still far from the local optimal solution to be found.
- We will continue to discuss this issue in the next chapter
边栏推荐
- Taobao product details API interface
- Tensorflow中使用tf.argmax返回张量沿指定维度最大值的索引
- When MySQL uses GROUP BY to group the query, the SELECT query field contains non-grouping fields
- 关于Android Service服务的面试题
- 详述MIMIC 的ICU患者检测时间信息表(十六)
- Activity的四种启动模式
- js判断图片是否存在
- 【软件测试】(北京)字节跳动科技有限公司终面HR面试题
- Unity开发者必备的C#脚本技巧
- Evolution and New Choice of Streaming Structured Data Computing Language
猜你喜欢
随机推荐
Tf中的平方,多次方,开方计算
Service的两种启动方式与区别
break pad源码编译--参考大佬博客的总结
Pinduoduo API interface
3.2-分类-Logistic回归
结合均线分析k线图的基本知识
There may be fields that cannot be serialized in the abnormal object of cdc and sqlserver. Is there anyone who can understand it? Help me to answer
LeetCode brushing series -- 46. Full arrangement
easyrecovery15数据恢复软件收费吗?功能强大吗?
语音信号处理:预处理【预加重、分帧、加窗】
Unity3D learning route?
Unity程序员如何提升自己的能力
STM32CUBEIDE(11)----输出PWM及修改PWM频率与占空比
详述MIMIC 的ICU患者检测时间信息表(十六)
【软件测试】(北京)字节跳动科技有限公司二面笔试题
关于Excel实现分组求和最全文档
1003 我要通过 (20 分)
1071 小赌怡情 (15 分)
How Unity programmers can improve their abilities
Edge 提供了标签分组功能