当前位置:网站首页>1.1-Regression
1.1-Regression
2022-08-11 07:51:00 【A boa constrictor. 6666】
文章目录
一、模型model
一个函数function的集合:
- 其中wi代表权重weight,b代表偏置值bias
- 𝑥𝑖Different properties can be taken,如: 𝑥𝑐𝑝, 𝑥ℎ𝑝, 𝑥𝑤,𝑥ℎ…
𝑦 = 𝑏 + ∑ w i x i 𝑦=𝑏+∑w_ix_i y=b+∑wixi
我们将𝑥𝑐𝑝Take it out as an unknown quantity,to find an optimal linear modelLinear model:
y = b + w ∙ X c p y = b + w ∙Xcp~ y=b+w∙Xcp
二、better functionfunction
损失函数Loss function 𝐿:
L的输入Input是一个函数 f ,输出outputis a specific value,And this value is the function used to evaluate the input f 到底有多坏
y ^ n \widehat{y}^n yn代表真实值,而 f ( x c p n ) f(x^n_{cp}) f(xcpn)代表预测值, L ( f ) L(f) L(f)represents the total error between the true value and the predicted value
L ( f ) = ∑ n = 1 10 ( y ^ n − f ( x c p n ) ) 2 L(f)=\sum_{n=1}^{10}(\widehat{y}^n-f(x^n_{cp}))^2 L(f)=n=1∑10(yn−f(xcpn))2将函数 f 用w,b替换,则可以写成下面这样
L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))^2 L(w,b)=n=1∑10(yn−(b+w⋅xcpn))2当 L 越小时,则说明该函数 f 越好,That is, the model is better.Each point in the graph below represents a function f
三、best functionfunction
梯度下降Gradient Descent:It is the process of finding the best function
$f^{ } represents the best function f u n c t i o n , represents the best functionfunction, represents the best functionfunction,w{*},b{ }:$represents the best weightweight和偏置值bias
f ∗ = a r g m i n f L ( f ) f^{*} =arg \underset{f}{min} L(f) f∗=argfminL(f)
w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^{*},b^{*}=arg \underset{w,b}{min} L(w,b) w∗,b∗=argw,bminL(w,b)
= a r g m i n w , b ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 =arg \underset{w,b}{min}\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))^2 =argw,bmin∑n=110(yn−(b+w⋅xcpn))2
3.1 一维函数
下图代表Lossfunction for gradient descent(Gradient Descent)的过程,首先随机选择一个 w 0 w^{0} w0.at that pointw求微分,如果为负数,Then we increase w 0 w^{0} w0的值;如果为正数,Then we reduce w 0 w^{0} w0的值.
- w ∗ = a r g m i n w L ( w ) w^{*}=arg\underset{w}{min}L(w) w∗=argwminL(w)
- w 0 = − η d L d w ∣ w w^{0}=-\eta\frac{dL}{dw}|_{w} w0=−ηdwdL∣w,其中 η 代表学习率:Learning rate,means the step size for each move(step)
- w 1 ← w 0 − η d L d w ∣ w = w 0 w^{1}\leftarrow w^{0}-\eta\frac{dL}{dw}|_{w=w^{0}} w1←w0−ηdwdL∣w=w0, w 1 w1 w1represents the initial point w 0 w^{0} w0The next point to move,Iterate like this(Iteration)下去,Eventually our local optimum will be found:Local optimal solution
3.2 二维函数
- for two-dimensional functions$Loss $ L ( w , b ) L(w,b) L(w,b)Find gradient descent: [ ∂ L ∂ w ∂ L ∂ b ] g r a d i e n t \begin{bmatrix} \frac{\partial L}{\partial w}\\ \frac{\partial L}{\partial b} \end{bmatrix}_{gradient} [∂w∂L∂b∂L]gradient
- w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^{*},b^{*}=arg \underset{w,b}{min} L(w,b) w∗,b∗=argw,bminL(w,b)
- 随机初始化 w 0 , b 0 w^{0},b^{0} w0,b0,然后计算 ∂ L ∂ w ∣ w = w 0 , b = b 0 \frac{\partial L}{\partial w}|_{w=w^{0},b=b^{0}} ∂w∂L∣w=w0,b=b0和 ∂ L ∂ b ∣ w = w 0 , b = b 0 \frac{\partial L}{\partial b}|_{w=w^{0},b=b^{0}} ∂b∂L∣w=w0,b=b0:
- w 1 ← w 0 − η ∂ L ∂ w ∣ w = w 0 , b = b 0 w^{1}\leftarrow w^{0}-\eta\frac{\partial L}{\partial w}|_{w=w^{0},b=b^{0}} w1←w0−η∂w∂L∣w=w0,b=b0
- b 1 ← b 0 − η ∂ L ∂ b ∣ w = w 0 , b = b 0 b^{1}\leftarrow b^{0}-\eta\frac{\partial L}{\partial b}|_{w=w^{0},b=b^{0}} b1←b0−η∂b∂L∣w=w0,b=b0
3.3 局部最优解和全局最优解
公式化(Formulation) ∂ L ∂ w \frac{\partial L}{\partial w} ∂w∂L和$
\frac{\partial L}{\partial b}$:- L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))^2 L(w,b)=∑n=110(yn−(b+w⋅xcpn))2
- ∂ L ∂ w = 2 ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − x c p n ) \frac{\partial L}{\partial w}=2\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))(-x^{n}_{cp}) ∂w∂L=2∑n=110(yn−(b+w⋅xcpn))(−xcpn)
- ∂ L ∂ b = 2 ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) \frac{\partial L}{\partial b}=2\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp})) ∂b∂L=2∑n=110(yn−(b+w⋅xcpn))
在非线性系统中,There may be multiple local optimal solutions:
3.4 模型的泛化(Generalization)能力
将根据lossThe best model found by the function is taken out,Calculate it separately on the training set(Training Data)和测试集(Testing Data)mean squared error on (Average Error),Of course, we only care about how well the model performs on the test set.
- y = b + w ∙ x c p y = b + w ∙x_{cp} y=b+w∙xcp Average Error=35.0
Because the mean square error of the original model is still relatively large,为了做得更好,Let's increase the complexity of the model.比如,引入二次项(xcp)2
- y = b + w 1 ∙ x c p + w 2 ∙ ( x c p ) 2 y = b + w1∙x_{cp} + w2∙(x_{cp)}2 y=b+w1∙xcp+w2∙(xcp)2 Average Error = 18.4
Continue to increase the complexity of the model,Introduce three terms(xcp)3
- y = b + w 1 ∙ x c p + w 2 ∙ ( x c p ) 2 + w 3 ∙ ( x c p ) 3 y = b + w1∙x_{cp} + w2∙(x_{cp})2+ w3∙(x_{cp})3 y=b+w1∙xcp+w2∙(xcp)2+w3∙(xcp)3 Average Error = 18.1
Continue to increase the complexity of the model,Introduce three terms(xcp)4,At this point, the mean squared error of the model on the training set becomes smaller,But the test set has become larger,This phenomenon is called overfitting of the model(Over-fitting)
- $y = b + w1∙x_{cp} + w2∙(x_{cp})2+ w3∙(x_{cp})3+ w4∙(x_{cp})4 $ Average Error = 28.8
3.5 hidden factor(hidden factors)
- When we don't just think about Pokémoncp值,Taking the species of Pokémon into account,The mean squared error on the test set is reduced to 14.3
- As we move on to consider other factors,Such as the height of each PokémonHeight,体重weight,经验值HP.The model becomes more complex at this point,Let's see how it performs on the test set,Very unfortunately the model overfits again.

3.5 正则化(Regularization)
为了解决过拟合的问题,We need to redesign the loss function L,The original loss function only calculated the variance,It does not take into account the influence of the input containing noise on the model.因此我们在 L Add an item after: λ ∑ ( w i ) 2 \lambda \sum (w_i)^2 λ∑(wi)2 ,This improves the generalization ability of the model,Make the model smoother,Reduce the sensitivity of the model to the input(Sensitive)
- Redesigned loss function L : L ( f ) = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L(f)=\underset{n}{\sum}(\widehat{y}^n-(b+\sum w_ix_i))^2+\lambda \sum (w_i)^2 L(f)=n∑(yn−(b+∑wixi))2+λ∑(wi)2
Obviously according to the following experiment,We got better performance, 当 λ = 100 时, T e s t E r r o r = 11.1 当\lambda=100时,Test Error = 11.1 当λ=100时,TestError=11.1
边栏推荐
猜你喜欢
随机推荐
【推荐系统】:协同过滤和基于内容过滤概述
STM32CUBEIDE(11)----输出PWM及修改PWM频率与占空比
如何选择专业、安全、高性能的远程控制软件
Discourse's Close Topic and Reopen Topic
Service的两种状态形式
tf.cast(), reduce_min(), reduce_max()
1003 I want to pass (20 points)
How Unity programmers can improve their abilities
基于FPGA的FIR滤波器的实现(5)— 并行结构FIR滤波器的FPGA代码实现
计算YUV文件的PSNR与SSIM
线程交替输出(你能想出几种方法)
1091 N-自守数 (15 分)
2022-08-10 mysql/stonedb-慢SQL-Q16-耗时追踪
联想集团:2022/23财年第一季度业绩
无服务器+域名也能搭建个人博客?真的,而且很快
C语言每日一练——Day02:求最小公倍数(3种方法)
When MySQL uses GROUP BY to group the query, the SELECT query field contains non-grouping fields
1101 B是A的多少倍 (15 分)
pytorch,numpy两种方法实现nms类间+类内
我的创作纪念日丨感恩这365天来有你相伴,不忘初心,各自精彩