当前位置:网站首页>Interpretation of the paper: Cross-Modality Fusion Transformer for Multispectral Object Detection

Interpretation of the paper: Cross-Modality Fusion Transformer for Multispectral Object Detection

2022-08-11 06:32:00 pontoon

(visible and thermal)

The thermal image on the right captures sharper outlines of pedestrians in low-light situations.Additionally, the thermal imagery captured pedestrians obscured by the pillars.In bright daylight, visual images have more detail, such as edges, textures and colors, than thermal images.With these details, we can easily find the driver hidden in the motor tricycle, which is difficult to find in thermal images.

1. Bottleneck problem:

The environment in the real world is constantly changing, such as rainy days, foggy days, sunny days, cloudy days, etc. It is difficult for algorithms to detect dynamic environmental changes using only visible sensor data (such as images captured by cameras).Target.Therefore, multispectral imaging techniques are gradually being adopted because of its ability to provide combined information from multispectral cameras, such as visible light and thermal imaging.And by fusing the complementarity of different modalities, the perceptibility, reliability and robustness of the detection algorithm can be further improved.

However, the introduction of multispectral data will create new problems:

a. How to integrate representations to take advantage of the inherent complementarity between different modalities?

b. And how to design an efficient cross-modal fusion mechanism to achieve maximum performance forming gain?

Extending them to cross-modal fusion or modal interaction using convolutional networks to take full advantage of the inherent complementarity is not trivial.Since the convolution operator has a non-global receptive field, the information is only fused within local regions.

2. Contribution of this article:

(1) Introduces a new powerful dual-stream backbone that augments one modality from another based on the Transformer scheme.

(2) We propose a simple and effective CFT module and conduct a theoretical analysis on it, showing that the CFT module fuses both intra-modal and inter-modal features.

(3) Experimental results achieve state-of-the-art performance on three public datasets, which confirms the effectiveness of the proposed method.

3. Solution:

  • Changes are made under the basic network framework of yolov5

  • The input is an image pair of different modalities, the backbone is changed to a two-stream network, and CFT (Cross-Modality Fusion Transformer) is embedded in it

  • Flatten the convolved feature maps of the two modal images and concat them together (denoted as I, (2HW*C)), and add positional coding to form the input of the CFT

Multiply three copies of I and the corresponding W(C*C) to obtain QKV.

The Transformer mechanism enables features of different modalities to interact.

Therefore, the modal fusion module does not need to be carefully designed.We simply concatenate multi-modal features into a sequence, and Transformer can automatically perform simultaneous intra- and inter-modal information fusion and robustly capture potential interactions between RGB and Ther-.

  1. Experiment:

Experiments on three public datasets (FILP. LLVIP. VEDAI.)

Training parameters:

We use SGD optimizer with an initial learning rate of 1e-2, a momentum of 0.937, and a weight decay of 0.0005. As for data augmentation, we use the Mosaic method which mixes four training images in one image.

p>

Ablation experiment was done on yolov5:

Comparison of experimental results on the FILP. dataset: (red arrows indicate missed detections)

Line 1: true tag

The second line: detection network without CFT

The third line: detection network with CFT

Comparison of experimental results on the LLVIP. dataset: (red arrows indicate missed detections)

Line 1: true tag

The second line: detection network without CFT

The third line: detection network with CFT

Comparison of experimental results on the VEDAI. dataset: (red arrows are missed detections, blue arrows are over-detected)

Line 1: true tag

The second line: detection network without CFT

The third line: detection network with CFT

Experimental metrics on three different datasets prove that the proposed method is optimal:

原网站

版权声明
本文为[pontoon]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/223/202208110515183404.html