当前位置:网站首页>Simple analysis of regularization principle (L1 / L2 regularization)
Simple analysis of regularization principle (L1 / L2 regularization)
2022-08-09 16:17:00 【pomelo33】
In machine learning and deep learning, in order to prevent the model from overfitting, there are usually two ways:
The first is to manually judge the importance of the data and retain the more important features, provided that there is sufficient prior knowledge.But at the same time, this is also equivalent to discarding part of the information in the data.
The second is regularization, which can automatically select important feature variables and automatically discard unnecessary feature variables by adopting certain constraints.
Commonly used regularization methods are:
L1/L2 regularization: A "penalty term" is added directly to the original loss function.
dropout: the most commonly used in deep learning, that is, randomly discarding some neurons during the training process.
Data augmentation: for example, flipping, translating, and stretching the original image to increase the training data set of the model.
Early termination method: Terminate the training early when the results obtained by the model training are relatively good.Human supervision and prior knowledge are required.
L2 regularization:
L2 regularization, that is, adding the sum of squares of weight parameters to the original loss function:
Ein is the training sample error without regularization, and λ is the regularization parameter.
Why add a weight parameter sum of squares?It is conceivable that when fitting a set of data, in general, it is easier to fit this set of data curves using higher-order polynomials.However, this will also make the model overly complex and prone to overfitting, that is, poor generalization ability.Therefore, the weights of the high-order parts can be limited to 0.This shifts the solution from higher-order problems to lower-order problems.But this method is more difficult to implement in practice.Therefore, a looser condition can be defined:
This limitationThe meaning of the condition is also very simple, that is, the sum of the values of all weights is less than or equal to C.
So why is this penalty term (constraint) set to be the sum of all weights?Here is a brief explanation:
As shown in the figure, the black ellipse is the area where Ein is minimized, and there is a blue point inside it which is the minimum value of Ein.The red circle is the restricted area, and the minimized point will move in the opposite direction of the gradient ▽Ein. Due to the restricted condition, the minimized point can only be within the red area.For the image above, the minimized point can only move along the red circle tangent.The loss function is minimized when the opposite direction of the gradient Ein coincides (i.e. parallel) with the direction of the center of the circle pointing to the minimized point (the direction of w).(Because the gradient ▽Ein has no component in the tangent direction, it will not move along the tangent any more).
So you get:
: (symbol is included in λ)
Considering this equation as a whole as a gradient, we getNew loss function:
This is L2 regularization.Similarly, L1 regularization is based on the original loss function plus the absolute value of the weight parameter:
The loss function actually contains two aspects: one is the training sample error.One is the regularization term.Among them, the parameter λ plays a balancing role.If λ is too large, the C of the constraint term is very small, that is, the area of the restricted area circle is very small, and the optimized result is far away from the real minimum point, resulting in underfitting.Vice versa, if λ is too small, the C of the constraint term is very large, that is, the area of the restricted area circle is very large, and the optimized result is very close to the real minimum point, and the effect of regularization is reduced, resulting in overfitting.Therefore, the choice of λ value is also very important.
边栏推荐
- 百度地图——地图找房功能
- DSPE-PEG-Aldehyde, DSPE-PEG-CHO, Phospholipid-PEG-Aldehyde MW: 1000
- redis从入门到精通
- My MySQL database was attacked and deleted for ransom, forcing me to use all my might to recover data
- 文件操作的实例——下载并合并流式视频文件
- 【基础版】整数加减乘除计算器
- 在量化交易过程中,散户可以这样做
- What is the difference between the four common resistors?
- MySql中什么是索引?常用的索引有哪些种类?索引在什么情况下会失效?
- 物理学专业英语(词汇整理)--------07
猜你喜欢
随机推荐
光线的数值追踪
函数调用约定
Mathematica 作图详解
OpenSSF的开源软件风险评估工具:Scorecards
物理学专业英语(词汇整理)--------07
redis从入门到精通
What are the hot topics in quantitative programmatic trading?
FilenameFilter过滤文件名
运算符学习
数组学习笔记
Servlet的生命周期
分析:通过哪种方法来建立股票量化交易数据库?
贝塞尔函数
复数与复数域
一些需要思考的物理问题
A Preliminary Study on Baidu Open Source e-chart
EasyExcel的应用
常见的数学物理方程
二维数组实现八皇后问题
navicat for Oraclel链接oracle 报错oracle library is not loaded的解决办法