当前位置:网站首页>Deep Learning [Chapter 2]
Deep Learning [Chapter 2]
2022-08-11 01:41:00 【sweetheart7-7】
文章目录
机器学习任务攻略

注意: 当 loss 在 training data When it is very big,如果增加模型复杂度,但是 loss 并没有减少,大概率是 optimization 有问题.
解决 o v e r f i t t i n g overfitting overfitting 的几种常见办法:
- 减少模型复杂度,Choose a simpler and smoother model
- 增加训练集数据
- Reduce parameters or share parameters
- 减少 feature
- Early stopping
- Regularization
- Dropout
How to pick as much as possible in the unknown testing data The above perform better model
You can add a validation set to choose a better one model,通常采用 N Fold cross-validation to split the dataset and perform validation.
类神经网络训练不起来怎么办
optimization Fails because…
Local minima(局部最小值)与 saddle point(鞍点)
梯度为 0

如何判断在 θ = θ ′ θ=θ' θ=θ′Loss function 形状:It is described by Taylor series expansion.

当满足 critical point 时,grdient 为 0

在 θ θ θ 为其他值时,如果都大于 L ( θ ′ ) L(θ') L(θ′) 时,Explain that this is the local minimum point…
But we can't bring all of them v v v 值,So it can be turned into the following judgment:
满足 v T H v > 0 v^THv > 0 vTHv>0 的 H H H(hessian) 矩阵叫做 positive definite.
positive definite 的特性:All eigenvalues are positive.

例子:

当 critical point 是 saddle point(鞍点)时,可以通过 Hessian 来帮我们判断 update 的方向.
Find the eigenvalue is Negative counterpart The direction of the eigenvectors,Go in this direction,will reduce the gradient.




一个点代表一个 network.
The vertical axis represents when the training stops,Loss 的大小.
The horizontal axis represents when the training stops,The ratio of eigenvalues with positive eigenvalues to all eigenvalues.
So in high dimensional spaces most are saddle points rather than local minima.
batch 与 momentum
batch

为什么要用 batch:每个 batch A wave of parameters can be updated


When there is parallel operation,batch size The big one might train one epoch 会更快.

但是在 batch size 小的 noise 对 optimization There may be better results.

小的 batch 对 training 更好 可能的解释:
每次 batch 时对应的 loss function 有差异,The corresponding gradients are different.

小 batch size 对 testing 更好:

Local minima 也有好坏之分,平原上的 Local minima 更好,in the canyon Local minima 更差,而 大的 batch size 会更倾向于 in the canyon Local minima.
因为小的 batch size 的 update The direction is random,Its easier to jump out Sharp Minima.

Momentum

普通的 grident descent 在 update only go 梯度的反方向

加上 Momentum 后,update 时,The gradient will go in the opposite direction of the gradient at this time as well momentum(The direction of the previous step) the inverse of the sum.

而 momentum It is the sum of all previous forward directions.


自动调整学习率 (Learning Rate)
当 Loss 不下降的时候,Not necessarily stuck critical point 处(Hard to get to critical point).


当 learning rate for planting,There may be two problems in the picture above(震荡 与 First normal and then very slow)
我们要改一下 gradient descend 的式子,make it in steep places learning rate 小,Gentle place learning rate 大.


Adagrad
相当于 如果 grident 大的话,σ 就大,σ 大的话,learning rate 就小了.

RMSProp
引入 α 来表示 新算出来的 grident 所占的比重.


Adam: RMSProp + Momentum

引入 Adagrad

η Set to a function that varies with time,increase over time η (learning rate) 越来越小.

Warm up (黑科技)

momentus It is to increase the inertia of historical movement,RMS is to moderate the size of the pace,become smoother

损失函数 (Loss)




当只有两个 class 时,一般采用 sigmoid ( 此时 sigmoid 跟 softmax 的作用等价),And two or more are used softmax.
minimizing cross-entropy 就相当于 maximizing linklihood

用 Mean Square 处理 classify 问题,May get stuck critical point.

·
边栏推荐
猜你喜欢
随机推荐
The latest domestic power supply manufacturers and pin-to-pin replacement manuals for specific models are released
vim简单保存窗口标识
生信实验记录(part3)--scipy.spatial.distance_matrix
vim simple save window id
C# using timer
异常:try catch finally throws throw
#yyds干货盘点#【愚公系列】2022年08月 Go教学课程 008-数据类型之整型
进程间通信方式(1)无名管道(全CSDN最用心的博主)
22-7-31
C# string与stream的相互转换
C # - delegate detailed usage
如何做到构建的提速,再提速
成功解决TypeError: can‘t multiply sequence by non-int of type ‘float‘
More parameter exposure of Pico 4: Pancake + color perspective, and Pro version
Shell Text Three Musketeers Sed
Data Analysis Interview Manual "SQL"
还在用 Xshell?你 out 了,推荐一个更现代的终端连接工具,好用到爆!
数据分析面试手册《SQL篇》
16. Sum of the nearest three numbers
QT+VTK+PCL拟合圆柱并计算起始点、中止点








