当前位置：网站首页>R-drop: a more powerful dropout regularization method

R-drop: a more powerful dropout regularization method

2022-04-23 11:11:00 【Graduate students are not late】

List of articles

1 Background introduction
- 1.1 Dropout technology
- 1.2 Regularized Dropout (R-Drop) technology
2 R-Dropout Introduction to the principle of
- 2.1 Model explanation
3 summary

Write it at the front ： This article studies and quotes Microsoft Research Institute AI headlines 、“ Mr meatball ” 's blog post “R-Drop—— More powerful Dropout”

1 Background introduction

1.1 Dropout technology

Deep neural network （DNN） Recently, it has achieved remarkable success in various fields . In training these large-scale DNN Model time , Regularization （regularization） technology , Such as L2 Normalization、Batch Normalization、Dropout And so on are indispensable modules , To prevent the model Over fitting （over-fitting）, meanwhile Improve the generalization of the model （generalization） Ability . In the Middle East: ,Dropout Technology only needs to discard a part of neurons in the training process , It has become the most widely used regularization Technology .

Regularization techniques regularization ： L2 Normalization、Batch Normalization、Dropout wait

However Dropout The operation of , To some extent, the trained model will become a combination constraint of multiple sub models . therefore , Put forward R-Drop

1.2 Regularized Dropout (R-Drop) technology

Microsoft Asia Research Institute and Suzhou University stay Dropout A further regularization method is proposed ：Regularized Dropout, abbreviation R-Drop.

Compared with the traditional action on neurons （Dropout） Or model parameters （DropConnect ） The constraint methods on are different ,R-Drop Act on the output layer of the model , Make up for Dropout Inconsistencies in training and testing .

Simply put, in every mini-batch in , Each data sample has been tested twice with Dropout The same model ,R-Drop Reuse KL-divergence The output of two constraints is consistent . therefore ,R-Drop Constrained by Dropout The output consistency of the two slave sub models .【 There are some obscure things here , We'll continue later 】

Compared with traditional training methods ,R- Drop It's just a simple addition KL-divergence Loss function term , No other changes . Although the method looks simple , But experiments have shown that , stay 5 A common contains NLP and CV In the task of （ altogether 18 Data sets ）,R-Drop Have achieved very good results , And in machine translation 、 The current optimal results have been achieved on tasks such as text summarization .

2 R-Dropout Introduction to the principle of

because DNN It's very easy to over fit , So we use Dropout Method , Randomly discard some neurons in each layer , In order to avoid the over fitting problem in the training process .

And because of random discarding , As a result, the sub models generated after each discard are different , therefore Dropout To some extent, the operation of The trained model is a combination constraint of multiple sub models .

be based on Dropout The randomness brought to the network by this special way , The researchers put forward R-Drop To further the （ Submodels ） The output prediction of the network is subject to regular constraints .

Add one more KL-divergence Loss function term

The overall framework is as follows ：

2.1 Model explanation

The model proposed in this paper is the right model in the figure above . You can see , The same data in two calculations , Random... Is used dropout after , Two different sub models are obtained , In the figure $P_1(y|x)$ and $P_2(y|x)$ Is the distribution of the two sub models .（ The two sub models are because ,Dropout The missing neurons are different ）
therefore , For the same input data $P_1^w(y|x)$ and $P_2^w(y|x)$ The distribution of is different .
therefore , In the training steps ,R-Dropout Method , Try to minimize the bidirectional relationship between these two output distributions of the same sample $K u l l b a c k - L e i b l e r (K L)$ Divergence is used to regularize the model prediction .

3 summary

Dropout It's obvious that ： Inconsistency between prediction and training , This is also very intuitive .
and R-D By adding a regular term , To strengthen the model for Dropout The robustness of , Make a difference Dropout The output of the lower model is basically the same , Therefore, this inconsistency can be reduced , promote “ The model is average ” And “ Weight average ” The similarity of , This makes it easy to close Dropout The effect is equivalent to more Dropout The result of model fusion , Improve the final performance of the model .
in general ,R-D The form is simple , The results are excellent , It's a very innovative idea . But for R-D Why can we achieve such excellent results , And how to guide the model to find the right R-D It is also worth exploring .

版权声明
本文为[Graduate students are not late]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204231106345777.html