当前位置:网站首页>R-drop: a more powerful dropout regularization method
R-drop: a more powerful dropout regularization method
2022-04-23 11:11:00 【Graduate students are not late】
List of articles
Write it at the front : This article studies and quotes Microsoft Research Institute AI headlines 、“ Mr meatball ” 's blog post “R-Drop—— More powerful Dropout”
1 Background introduction
1.1 Dropout technology
Deep neural network (DNN) Recently, it has achieved remarkable success in various fields . In training these large-scale DNN Model time , Regularization (regularization) technology , Such as L2 Normalization、Batch Normalization、Dropout And so on are indispensable modules , To prevent the model Over fitting (over-fitting), meanwhile Improve the generalization of the model (generalization) Ability . In the Middle East: ,Dropout Technology only needs to discard a part of neurons in the training process , It has become the most widely used regularization Technology .
Regularization techniques regularization : L2 Normalization、Batch Normalization、Dropout wait
- However Dropout The operation of , To some extent, the trained model will become a combination constraint of multiple sub models . therefore , Put forward
R-Drop

1.2 Regularized Dropout (R-Drop) technology
Microsoft Asia Research Institute and Suzhou University stay Dropout A further regularization method is proposed :Regularized Dropout, abbreviation R-Drop.
Compared with the traditional action on neurons (Dropout) Or model parameters (DropConnect ) The constraint methods on are different ,R-Drop Act on the output layer of the model , Make up for Dropout Inconsistencies in training and testing .
Simply put, in every mini-batch in , Each data sample has been tested twice with Dropout The same model ,R-Drop Reuse KL-divergence The output of two constraints is consistent . therefore ,R-Drop Constrained by Dropout The output consistency of the two slave sub models .【 There are some obscure things here , We'll continue later 】
Compared with traditional training methods ,R- Drop It's just a simple addition KL-divergence Loss function term , No other changes . Although the method looks simple , But experiments have shown that , stay 5 A common contains NLP and CV In the task of ( altogether 18 Data sets ),R-Drop Have achieved very good results , And in machine translation 、 The current optimal results have been achieved on tasks such as text summarization .
2 R-Dropout Introduction to the principle of
because DNN It's very easy to over fit , So we use Dropout Method , Randomly discard some neurons in each layer , In order to avoid the over fitting problem in the training process .
And because of random discarding , As a result, the sub models generated after each discard are different , therefore Dropout To some extent, the operation of The trained model is a combination constraint of multiple sub models .
be based on Dropout The randomness brought to the network by this special way , The researchers put forward R-Drop To further the ( Submodels ) The output prediction of the network is subject to regular constraints .
Add one more KL-divergence Loss function term
- The overall framework is as follows :


2.1 Model explanation
- The model proposed in this paper is the right model in the figure above . You can see , The same data in two calculations , Random... Is used dropout after , Two different sub models are obtained , In the figure P 1 ( y ∣ x ) P_1(y|x) P1(y∣x) and P 2 ( y ∣ x ) P_2(y|x) P2(y∣x) Is the distribution of the two sub models .( The two sub models are because ,Dropout The missing neurons are different )
- therefore , For the same input data P 1 w ( y ∣ x ) P_1^w(y|x) P1w(y∣x) and P 2 w ( y ∣ x ) P_2^w(y|x) P2w(y∣x) The distribution of is different .
- therefore , In the training steps ,R-Dropout Method , Try to minimize the bidirectional relationship between these two output distributions of the same sample K u l l b a c k − L e i b l e r ( K L ) Kullback−Leibler(KL) Kullback−Leibler(KL) Divergence is used to regularize the model prediction .
3 summary
-
Dropout It's obvious that : Inconsistency between prediction and training , This is also very intuitive .
-
and R-D By adding a regular term , To strengthen the model for Dropout The robustness of , Make a difference Dropout The output of the lower model is basically the same , Therefore, this inconsistency can be reduced , promote “ The model is average ” And “ Weight average ” The similarity of , This makes it easy to close Dropout The effect is equivalent to more Dropout The result of model fusion , Improve the final performance of the model .
-
in general ,R-D The form is simple , The results are excellent , It's a very innovative idea . But for R-D Why can we achieve such excellent results , And how to guide the model to find the right R-D It is also worth exploring .
版权声明
本文为[Graduate students are not late]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231106345777.html
边栏推荐
- MySQL Router重装后重新连接集群进行引导出现的——此主机中之前已配置过的问题
- 数据库管理软件SQLPro for SQLite for Mac 2022.30
- Visual solutions to common problems (VIII) mathematical formulas
- Detailed introduction to paging exploration of MySQL index optimization
- Facing the global market, platefarm today logs in to four major global platforms such as Huobi
- Xdotool key Wizard
- vm设置静态虚拟机
- R-Drop:更强大的Dropout正则方法
- Pycharm
- Using El popconfirm and El backtop does not take effect
猜你喜欢
随机推荐
Mysql8.0安装指南
Software testers, how to mention bugs?
MBA-day6 逻辑学-假言推理练习题
Constraintlayout layout
How to Ping Baidu development board
得物技术网络优化-CDN资源请求优化实践
JDBC – PreparedStatement – 如何设置 Null 值?
mysql插入datetime类型字段不加单引号插入不成功
Chapter 1 of technical Xiaobai (express yourself)
Mysql中一千万条数据怎么快速查询
MySQL partition table can be classified by month
Upgrade the functions available for cpolar intranet penetration
MBA-day5数学-应用题-工程问题
Xdotool key Wizard
vm设置静态虚拟机
Cumcm 2021 - B: préparation d'oléfines C4 par couplage éthanol (2)
Promise详解
Detailed introduction to paging exploration of MySQL index optimization
Learning website materials
详解MySQL中timestamp和datetime时区问题导致做DTS遇到的坑









