当前位置:网站首页>是什么让训练综合分类网络艰苦?

是什么让训练综合分类网络艰苦?

2022-08-10 03:32:00 Rainylt

paper : What makes training multi-modal classification networks hard?
cvpr2020

One sentence summary: In multi-modal training, the overfitting index obtained by combining the validation set and the training set is used to modulate the Loss< of different modalities./strong>Weights to solve the imbalance problem of multimodal training.

After a sentence is finished, there are still many questions. This article is also worth writing a note.


Multimodal fusion method

(1) Early Fusion
Concat or fuse by other means when the original data is input
(2) Mid Fusion
After extracting features, concat, then go through the Fusion module, and then go through the classification head
(3) Late Fusion
After extracting features, concat, and directly pass the classification header

What is the multimodal training imbalance problem?

The reason is that the author found that in the video classification task, the multi-modal model is not as good as the single-modal model
Insert image description here
As shown above, A is Audio and OF is optical flow.
The models used are similar. For example, A+RGB is to add Audio's Encoder on the basis of single RGB, and then concat the two features together and classify them through the classifier.The single RGB is directly RGB through the encoder, and then classified by the classifier.
insert image description here

It seems that there is no transformer added after concat for fuse?Maybe the fuse module can solve this problem to some extent?

Why?

Two findings:
(1) The multi-modal model has a higher training set accuracy, but the validation set accuracy is lower
(2) The Late Fusion model has a singleThe modal model has almost twice as many parameters

=>Suspected problem is overfitting

How to solve it?

First try the conventional solution to overfitting:
1. Dropout
2. Pre-train
3. early stop
Then try the mid-Fusion solution:
1. Concatenate (conv gets the feature, continue to conv after concat, and finally pass the classification header)
2. Gating in SE mode
3. Gating in Non-Local mode (that is, transformer)

Comments:
1. Early stop is completely impossible, maybe it is stoptoo early?
2. Pre-train is better than late-concat without Pre-train, but it can't catch up with single-modal RGB, or maybe it's because Pre-train is not very good?
3. Mid-concat has some improvements, but it is better than Non-Local. I didn't expect it, because Non-Local is only connected to one layer?Are there more convs after mid-concat?
4. Dropout has some effects, because there is indeed overfitting, but dropout was not added?
5. SE is a channel-level Attention, and there is no space, time, and frequency-level Attention, and the effect is poor.Excusable, adding it is the same as not adding it
This part has to look at the detail.

In other words, mid-concat is better than late concat, and dropout can also work, but the improvement is not much.
If dropout can work, it means that the network does have overfitting. After adding dropout, it finally exceedsSingle mode, mid-concat can also exceed single mode, so it is still necessary to fusion early.

Project for this article

In order to solve the problem of overfitting, this paper first proposes an indicator to measure the degree of overfitting:
Insert image description here
*Represents the real scene (approximately on the validation set).
At the same time, based on Late Fusion, there are 3 branches, each with a classification header:
Insert picture description here
The middle is the feature after concat, plus a classification header, and the two sides are the respective classification headers.Each Loss has a weight, and the weight is optimized by the above indicators:
hereInsert image description

Finally get the weight formula:
insert image description here
The derivation is not too muchI understand, but in short, he assigned a weight to each Loss, so how did he do it in the final test?

原网站

版权声明
本文为[Rainylt]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/222/202208100201260294.html