当前位置:网站首页>1.1-Regression

1.1-Regression

2022-08-11 07:51:00 A boa constrictor. 6666

一、模型model

  • 一个函数function的集合:

    • 其中wi代表权重weight,b代表偏置值bias
    • 𝑥𝑖Different properties can be taken,如: 𝑥𝑐𝑝, 𝑥ℎ𝑝, 𝑥𝑤,𝑥

    𝑦 = 𝑏 + ∑ w i x i 𝑦=𝑏+∑w_ix_i y=b+wixi

  • 我们将𝑥𝑐𝑝Take it out as an unknown quantity,to find an optimal linear modelLinear model:
    y = b + w ∙ X c p   y = b + w ∙Xcp~ y=b+wXcp 

二、better functionfunction

  • 损失函数Loss function 𝐿:

    • L的输入Input是一个函数 f ,输出outputis a specific value,And this value is the function used to evaluate the input f 到底有多坏

    • y ^ n \widehat{y}^n yn代表真实值,而 f ( x c p n ) f(x^n_{cp}) f(xcpn)代表预测值, L ( f ) L(f) L(f)represents the total error between the true value and the predicted value
      L ( f ) = ∑ n = 1 10 ( y ^ n − f ( x c p n ) ) 2 L(f)=\sum_{n=1}^{10}(\widehat{y}^n-f(x^n_{cp}))^2 L(f)=n=110(ynf(xcpn))2

    • 将函数 f 用w,b替换,则可以写成下面这样
      L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))^2 L(w,b)=n=110(yn(b+wxcpn))2

    • 当 L 越小时,则说明该函数 f 越好,That is, the model is better.Each point in the graph below represents a function f

      image-20220802195837029

三、best functionfunction

  • 梯度下降Gradient Descent:It is the process of finding the best function

  • $f^{ } represents the best function f u n c t i o n , represents the best functionfunction, represents the best functionfunction,w{*},b{ }:$represents the best weightweight和偏置值bias

    f ∗ = a r g m i n f L ( f ) f^{*} =arg \underset{f}{min} L(f) f=argfminL(f)

    w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^{*},b^{*}=arg \underset{w,b}{min} L(w,b) w,b=argw,bminL(w,b)

    = a r g m i n w , b ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 =arg \underset{w,b}{min}\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))^2 =argw,bminn=110(yn(b+wxcpn))2

3.1 一维函数

下图代表Lossfunction for gradient descent(Gradient Descent)的过程,首先随机选择一个 w 0 w^{0} w0.at that pointw求微分,如果为负数,Then we increase w 0 w^{0} w0的值;如果为正数,Then we reduce w 0 w^{0} w0的值.

  • w ∗ = a r g m i n w L ( w ) w^{*}=arg\underset{w}{min}L(w) w=argwminL(w)
  • w 0 = − η d L d w ∣ w w^{0}=-\eta\frac{dL}{dw}|_{w} w0=ηdwdLw,其中 η 代表学习率:Learning rate,means the step size for each move(step)
  • w 1 ← w 0 − η d L d w ∣ w = w 0 w^{1}\leftarrow w^{0}-\eta\frac{dL}{dw}|_{w=w^{0}} w1w0ηdwdLw=w0, w 1 w1 w1represents the initial point w 0 w^{0} w0The next point to move,Iterate like this(Iteration)下去,Eventually our local optimum will be found:Local optimal solution

image-20220802204819209image-20220803225229903image-20220803225255771

3.2 二维函数

  • for two-dimensional functions$Loss $ L ( w , b ) L(w,b) L(w,b)Find gradient descent: [ ∂ L ∂ w ∂ L ∂ b ] g r a d i e n t \begin{bmatrix} \frac{\partial L}{\partial w}\\ \frac{\partial L}{\partial b} \end{bmatrix}_{gradient} [wLbL]gradient
  • w ∗ , b ∗ = a r g m i n w , b L ( w , b ) w^{*},b^{*}=arg \underset{w,b}{min} L(w,b) w,b=argw,bminL(w,b)
  • 随机初始化 w 0 , b 0 w^{0},b^{0} w0,b0,然后计算 ∂ L ∂ w ∣ w = w 0 , b = b 0 \frac{\partial L}{\partial w}|_{w=w^{0},b=b^{0}} wLw=w0,b=b0 ∂ L ∂ b ∣ w = w 0 , b = b 0 \frac{\partial L}{\partial b}|_{w=w^{0},b=b^{0}} bLw=w0,b=b0
    • w 1 ← w 0 − η ∂ L ∂ w ∣ w = w 0 , b = b 0 w^{1}\leftarrow w^{0}-\eta\frac{\partial L}{\partial w}|_{w=w^{0},b=b^{0}} w1w0ηwLw=w0,b=b0
    • b 1 ← b 0 − η ∂ L ∂ b ∣ w = w 0 , b = b 0 b^{1}\leftarrow b^{0}-\eta\frac{\partial L}{\partial b}|_{w=w^{0},b=b^{0}} b1b0ηbLw=w0,b=b0
image-20220803232007155

3.3 局部最优解和全局最优解

  • 公式化(Formulation) ∂ L ∂ w \frac{\partial L}{\partial w} wL和$
    \frac{\partial L}{\partial b}$:

    • L ( w , b ) = ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) 2 L(w,b)=\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))^2 L(w,b)=n=110(yn(b+wxcpn))2
    • ∂ L ∂ w = 2 ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) ( − x c p n ) \frac{\partial L}{\partial w}=2\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp}))(-x^{n}_{cp}) wL=2n=110(yn(b+wxcpn))(xcpn)
    • ∂ L ∂ b = 2 ∑ n = 1 10 ( y ^ n − ( b + w ⋅ x c p n ) ) \frac{\partial L}{\partial b}=2\sum_{n=1}^{10}(\widehat{y}^n-(b+w \cdot x^n_{cp})) bL=2n=110(yn(b+wxcpn))
  • 在非线性系统中,There may be multiple local optimal solutions:

    image-20220804074208523

3.4 模型的泛化(Generalization)能力

  • 将根据lossThe best model found by the function is taken out,Calculate it separately on the training set(Training Data)和测试集(Testing Data)mean squared error on (Average Error),Of course, we only care about how well the model performs on the test set.

    • y = b + w ∙ x c p y = b + w ∙x_{cp} y=b+wxcp Average Error=35.0
  • Because the mean square error of the original model is still relatively large,为了做得更好,Let's increase the complexity of the model.比如,引入二次项(xcp)2

    • y = b + w 1 ∙ x c p + w 2 ∙ ( x c p ) 2 y = b + w1∙x_{cp} + w2∙(x_{cp)}2 y=b+w1xcp+w2(xcp)2 Average Error = 18.4
  • Continue to increase the complexity of the model,Introduce three terms(xcp)3

    • y = b + w 1 ∙ x c p + w 2 ∙ ( x c p ) 2 + w 3 ∙ ( x c p ) 3 y = b + w1∙x_{cp} + w2∙(x_{cp})2+ w3∙(x_{cp})3 y=b+w1xcp+w2(xcp)2+w3(xcp)3 Average Error = 18.1
  • Continue to increase the complexity of the model,Introduce three terms(xcp)4,At this point, the mean squared error of the model on the training set becomes smaller,But the test set has become larger,This phenomenon is called overfitting of the model(Over-fitting)

    • $y = b + w1∙x_{cp} + w2∙(x_{cp})2+ w3∙(x_{cp})3+ w4∙(x_{cp})4 $ Average Error = 28.8
image-20220804085429768

3.5 hidden factor(hidden factors)

  • When we don't just think about Pokémoncp值,Taking the species of Pokémon into account,The mean squared error on the test set is reduced to 14.3

image-20220804085728883image-20220804085825281

  • As we move on to consider other factors,Such as the height of each PokémonHeight,体重weight,经验值HP.The model becomes more complex at this point,Let's see how it performs on the test set,Very unfortunately the model overfits again.
image-20220804090313298

3.5 正则化(Regularization)

  • 为了解决过拟合的问题,We need to redesign the loss function L,The original loss function only calculated the variance,It does not take into account the influence of the input containing noise on the model.因此我们在 L Add an item after: λ ∑ ( w i ) 2 \lambda \sum (w_i)^2 λ(wi)2 ,This improves the generalization ability of the model,Make the model smoother,Reduce the sensitivity of the model to the input(Sensitive)

    • Redesigned loss function L : L ( f ) = ∑ n ( y ^ n − ( b + ∑ w i x i ) ) 2 + λ ∑ ( w i ) 2 L(f)=\underset{n}{\sum}(\widehat{y}^n-(b+\sum w_ix_i))^2+\lambda \sum (w_i)^2 L(f)=n(yn(b+wixi))2+λ(wi)2
  • Obviously according to the following experiment,We got better performance, 当 λ = 100 时, T e s t E r r o r = 11.1 当\lambda=100时,Test Error = 11.1 λ=100时,TestError=11.1

image-20220804093414880
原网站

版权声明
本文为[A boa constrictor. 6666]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/223/202208110650015437.html