当前位置:网站首页>R-drop: a more powerful dropout regularization method
R-drop: a more powerful dropout regularization method
2022-04-23 11:11:00 【Graduate students are not late】
List of articles
Write it at the front : This article studies and quotes Microsoft Research Institute AI headlines 、“ Mr meatball ” 's blog post “R-Drop—— More powerful Dropout”
1 Background introduction
1.1 Dropout technology
Deep neural network (DNN) Recently, it has achieved remarkable success in various fields . In training these large-scale DNN Model time , Regularization (regularization) technology , Such as L2 Normalization、Batch Normalization、Dropout And so on are indispensable modules , To prevent the model Over fitting (over-fitting), meanwhile Improve the generalization of the model (generalization) Ability . In the Middle East: ,Dropout Technology only needs to discard a part of neurons in the training process , It has become the most widely used regularization Technology .
Regularization techniques regularization : L2 Normalization、Batch Normalization、Dropout wait
- However Dropout The operation of , To some extent, the trained model will become a combination constraint of multiple sub models . therefore , Put forward
R-Drop
1.2 Regularized Dropout (R-Drop) technology
Microsoft Asia Research Institute and Suzhou University stay Dropout A further regularization method is proposed :Regularized Dropout, abbreviation R-Drop.
Compared with the traditional action on neurons (Dropout) Or model parameters (DropConnect ) The constraint methods on are different ,R-Drop Act on the output layer of the model , Make up for Dropout Inconsistencies in training and testing .
Simply put, in every mini-batch in , Each data sample has been tested twice with Dropout The same model ,R-Drop Reuse KL-divergence The output of two constraints is consistent . therefore ,R-Drop Constrained by Dropout The output consistency of the two slave sub models .【 There are some obscure things here , We'll continue later 】
Compared with traditional training methods ,R- Drop It's just a simple addition KL-divergence Loss function term , No other changes . Although the method looks simple , But experiments have shown that , stay 5 A common contains NLP and CV In the task of ( altogether 18 Data sets ),R-Drop Have achieved very good results , And in machine translation 、 The current optimal results have been achieved on tasks such as text summarization .
2 R-Dropout Introduction to the principle of
because DNN It's very easy to over fit , So we use Dropout Method , Randomly discard some neurons in each layer , In order to avoid the over fitting problem in the training process .
And because of random discarding , As a result, the sub models generated after each discard are different , therefore Dropout To some extent, the operation of The trained model is a combination constraint of multiple sub models .
be based on Dropout The randomness brought to the network by this special way , The researchers put forward R-Drop To further the ( Submodels ) The output prediction of the network is subject to regular constraints .
Add one more KL-divergence Loss function term
- The overall framework is as follows :
2.1 Model explanation
- The model proposed in this paper is the right model in the figure above . You can see , The same data in two calculations , Random... Is used dropout after , Two different sub models are obtained , In the figure P 1 ( y ∣ x ) P_1(y|x) P1(y∣x) and P 2 ( y ∣ x ) P_2(y|x) P2(y∣x) Is the distribution of the two sub models .( The two sub models are because ,Dropout The missing neurons are different )
- therefore , For the same input data P 1 w ( y ∣ x ) P_1^w(y|x) P1w(y∣x) and P 2 w ( y ∣ x ) P_2^w(y|x) P2w(y∣x) The distribution of is different .
- therefore , In the training steps ,R-Dropout Method , Try to minimize the bidirectional relationship between these two output distributions of the same sample K u l l b a c k − L e i b l e r ( K L ) Kullback−Leibler(KL) Kullback−Leibler(KL) Divergence is used to regularize the model prediction .
3 summary
-
Dropout It's obvious that : Inconsistency between prediction and training , This is also very intuitive .
-
and R-D By adding a regular term , To strengthen the model for Dropout The robustness of , Make a difference Dropout The output of the lower model is basically the same , Therefore, this inconsistency can be reduced , promote “ The model is average ” And “ Weight average ” The similarity of , This makes it easy to close Dropout The effect is equivalent to more Dropout The result of model fusion , Improve the final performance of the model .
-
in general ,R-D The form is simple , The results are excellent , It's a very innovative idea . But for R-D Why can we achieve such excellent results , And how to guide the model to find the right R-D It is also worth exploring .
版权声明
本文为[Graduate students are not late]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231106345777.html
边栏推荐
- Visual common drawing (IV) histogram
- Visual Road (XII) detailed explanation of collection class
- Visualization Road (11) detailed explanation of Matplotlib color
- After the MySQL router is reinstalled, it reconnects to the cluster for boot - a problem that has been configured in this host before
- Differences among restful, soap, RPC, SOA and microservices
- MySQL索引优化之分页探索详细介绍
- 闹钟场景识别
- More reliable model art than deep learning
- Upgrade the functions available for cpolar intranet penetration
- MBA-day5数学-应用题-工程问题
猜你喜欢
Visual common drawing (IV) histogram
26. Delete duplicates in ordered array
Jupyter Lab 十大高生产力插件
Excel · VBA custom function to obtain multiple cell values
CUMCM 2021-B:乙醇偶合制備C4烯烴(2)
Visualization Road (11) detailed explanation of Matplotlib color
redis优化系列(二)Redis主从原理、主从常用配置
Visual common drawing (III) area map
PDMS软光刻加工过程
Excel·VBA自定义函数获取单元格多数值
随机推荐
期货开户哪个公司好?安全靠谱的期货公司谁能推荐几家?
Three web components (servlet, filter, listener)
Visual solutions to common problems (VIII) mathematical formulas
MySQL数据库10秒内插入百万条数据的实现
CUMCM 2021-B:乙醇偶合制备C4烯烃(2)
CUMCM 2021-b: preparation of C4 olefins by ethanol coupling (2)
Source insight 4.0 FAQs
vm设置静态虚拟机
Visual common drawing (IV) histogram
Upgrade the functions available for cpolar intranet penetration
About the three commonly used auxiliary classes of JUC
Anaconda3 installation
Cumcm 2021 - B: préparation d'oléfines C4 par couplage éthanol (2)
Mysql database transaction example tutorial
语雀文档编辑器将开源:始于但不止于Markdown
Mba-day6 logic - hypothetical reasoning exercises
Which company is good for opening futures accounts? Who can recommend several safe and reliable futures companies?
MySQL8.0升级的踩坑历险记
Mysql8. 0 installation guide
Learning go language 0x02: understanding slice