当前位置:网站首页>Interpretation of the paper: Cross-Modality Fusion Transformer for Multispectral Object Detection
Interpretation of the paper: Cross-Modality Fusion Transformer for Multispectral Object Detection
2022-08-11 06:32:00 【pontoon】
(visible and thermal)
The thermal image on the right captures sharper outlines of pedestrians in low-light situations.Additionally, the thermal imagery captured pedestrians obscured by the pillars.In bright daylight, visual images have more detail, such as edges, textures and colors, than thermal images.With these details, we can easily find the driver hidden in the motor tricycle, which is difficult to find in thermal images.
1. Bottleneck problem:
The environment in the real world is constantly changing, such as rainy days, foggy days, sunny days, cloudy days, etc. It is difficult for algorithms to detect dynamic environmental changes using only visible sensor data (such as images captured by cameras).Target.Therefore, multispectral imaging techniques are gradually being adopted because of its ability to provide combined information from multispectral cameras, such as visible light and thermal imaging.And by fusing the complementarity of different modalities, the perceptibility, reliability and robustness of the detection algorithm can be further improved.
However, the introduction of multispectral data will create new problems:
a. How to integrate representations to take advantage of the inherent complementarity between different modalities?
b. And how to design an efficient cross-modal fusion mechanism to achieve maximum performance forming gain?
Extending them to cross-modal fusion or modal interaction using convolutional networks to take full advantage of the inherent complementarity is not trivial.Since the convolution operator has a non-global receptive field, the information is only fused within local regions.
2. Contribution of this article:
(1) Introduces a new powerful dual-stream backbone that augments one modality from another based on the Transformer scheme.
(2) We propose a simple and effective CFT module and conduct a theoretical analysis on it, showing that the CFT module fuses both intra-modal and inter-modal features.
(3) Experimental results achieve state-of-the-art performance on three public datasets, which confirms the effectiveness of the proposed method.
3. Solution:
Changes are made under the basic network framework of yolov5
The input is an image pair of different modalities, the backbone is changed to a two-stream network, and CFT (Cross-Modality Fusion Transformer) is embedded in it
Flatten the convolved feature maps of the two modal images and concat them together (denoted as I, (2HW*C)), and add positional coding to form the input of the CFT
Multiply three copies of I and the corresponding W(C*C) to obtain QKV.
The Transformer mechanism enables features of different modalities to interact.
Therefore, the modal fusion module does not need to be carefully designed.We simply concatenate multi-modal features into a sequence, and Transformer can automatically perform simultaneous intra- and inter-modal information fusion and robustly capture potential interactions between RGB and Ther-.
Experiment:
Experiments on three public datasets (FILP. LLVIP. VEDAI.)
Training parameters:
We use SGD optimizer with an initial learning rate of 1e-2, a momentum of 0.937, and a weight decay of 0.0005. As for data augmentation, we use the Mosaic method which mixes four training images in one image.
p>Ablation experiment was done on yolov5:
Comparison of experimental results on the FILP. dataset: (red arrows indicate missed detections)
Line 1: true tag
The second line: detection network without CFT
The third line: detection network with CFT
Comparison of experimental results on the LLVIP. dataset: (red arrows indicate missed detections)
Line 1: true tag
The second line: detection network without CFT
The third line: detection network with CFT
Comparison of experimental results on the VEDAI. dataset: (red arrows are missed detections, blue arrows are over-detected)
Line 1: true tag
The second line: detection network without CFT
The third line: detection network with CFT
Experimental metrics on three different datasets prove that the proposed method is optimal:
边栏推荐
猜你喜欢
网络七层结构(讲人话)
win10 配置tensorflow(GPU) anaconda3 cuda9.0 cudnn for 9.0
typescript学习日记,从基础到进阶(第二章)
【调试记录1】提高MC3172浮点运算能力,IQmath库的获取与导入使用教程
活动预告 | 4月23日,多场OpenMLDB精彩分享来袭,不负周末好时光
Wisdom construction site safety helmet identification system
Diagnostic Log and Trace——为应用程序和上下文设置日志级别的方法
跨应用间调用: URL Scheme
vscode插件开发——懒人专用markdown插件开发
小程序技术原理分析
随机推荐
STM32F407-浅~~析UART异步通信&USART_Init函数入口参数
Error: Flash Download failed - “Cortex-M4“-STM32F4
自定义形状seekbar学习
STM32F4-正点原子探索者-SYSTEM文件夹下的delay.c文件内延时函数详解
weex入门踩坑
OpenMLDB官网升级,神秘贡献者地图带你快速进阶
Typescript学习日记,typescript从基础到进阶(第一章)
开源机器学习数据库OpenMLDB贡献者计划全面启动
咕咚vs悦跑圈的竞品分析
USB中用NRZI来编码数据
C语言实现简易扫雷(附带源码)
Wisdom construction site safety helmet identification system
SearchGuard证书配置
NUC980-开发环境搭建
自定义形状seekbar学习--方向盘view
实时特征计算平台架构方法论和基于 OpenMLDB 的实践
产品经理的基础知识
vim 编辑解决中文乱码问题
关于接口响应内容的解码
yolov3+centerloss+replay buffer实现单人物跟踪