当前位置:网站首页>He Kaiming's new work ViTDET: target detection field, subverting the concept of layered backbone
He Kaiming's new work ViTDET: target detection field, subverting the concept of layered backbone
2022-08-11 06:34:00 【pontoon】
Exploring Plain Vision Transformer Backbones for Object Detection
【网址】:
https://arxiv.org/abs/2203.16527
【开源代码】:Code will be made available.
Object detection tasks benefit from being independent and genericbackboneand modules specially designed for detection tasks(neck,head).很长一段时间以来,due to the nature of convolutional networks,这些backboneIt has always been a multi-scale layered architecture,This severely affects the detection of multi-scale objectsneck/head设计.Vision Transformers (ViT) 与典型的 ConvNets 不同,原始的 ViT 是一个简单的、非分层的架构,它始终保持单尺度特征图. 它的“极简主义”Challenges are encountered when applied to object detection:
其一,How to use pre-trained ones in downstream tasksplain backboneThe network can handle objects of various sizes?
其二,全局注意力机制的复杂度与输入图像尺寸的平方呈正比,在面对高分辨率图像时,处理效率低下.
放弃这种追求的一种解决方案是将分层设计重新引入主干.in that solution,例如 Swin Transformers及相关研究,可以继承基于 ConvNet 的检测器设计,并已取得成功.
在这项工作中,我们追求不同的方向:我们探索仅使用普通、非分层backbone的目标检测器.如果这个方向成功,它将能够使用原始的 ViT backbone进行目标检测;这将使预训练设计与微调需求脱钩,保持上游与下游任务的独立性,就像基于 ConvNet 的研究一样. 这个方向也部分遵循了 ViT 在追求通用特征时“减少归纳偏差”的准则. 由于非局部自注意力计算可以学习平移等变特征,They can also learn scale-invariant features from some form of supervised or self-supervised pretraining.(个人理解:例如swtDesign imitationconvnet,Added inductive bias)
1.瓶颈问题:
仅使用普通、非分层backbone(ViT)The problem caused by the object detector:
其一,How to use pre-trained ones in downstream tasksplain backboneThe network can handle objects of various sizes?
其二,全局注意力机制的复杂度与输入图像尺寸的平方呈正比,在面对高分辨率图像时,处理效率低下.
2.本文贡献:
(1) A method is proposed to use only ordinary、非分层backbone(ViT)The target detector is ViTDet,Can be layered with leadingbackbone检测器(例如,Swin、MViT)竞争,仅使用没有标签的 ImageNet-1K Pre-training can exceedImageNet-21K Pretrained layersbackbone检测器.
(2) 在普通的 ViT backbone,舍弃了FPN 模块,And just use a single scalefeatur map进行操作.
(3) 在ViT backbone上应用window attentionSolved when faced with high-resolution images,Deal with inefficiencies,And use only a small amount afterwardscross-window blocks.
(4) Our approach keeps the detection module-specific design and task agnosticbackboneseparation concept,The prior knowledge of the detection module is only introduced during fine-tuning,No need to adjust a priori in pretrainingbackbone设计.(个人理解:For example, it needs to be manually set according to the target sizeFPN层数,Hierarchical structure, etc).
3.解决方案:
Our goal is to eliminate pairsbackbonethe hierarchical constraints,and enable to normalbackboneExploration of Object Detection. 为此,我们的目标是进行最少的修改,to make it simple only during fine-tuningbackboneAdapted to object detection tasks. 在这些适应之后,原则上可以应用任何检测器头,为此我们选择使用 Mask R-CNN及其扩展. 我们的目标不是开发新组件.
- 提出Simple feature pyramid(SFP): 

On the left is the traditional layeringbackbone+FPN,右边是ViT原始backbone+SFP.

Only use frombackboneThe last feature map of ,在这张特征图上,Apply a set of convolutions or deconvolutions to generate multi-scale feature maps.传统FPN中,Feature maps at different scales come from convolutional layers with different downsampling ratios,在普通的 ViT backbone中,We found that this is not necessary,简单的反卷积就足够了.
作者也探讨了(b)方式,发现效果并不好.(个人理解:最初的FPNThe motivation is to put low resolution、Strong feature maps with high resolution、Weak feature maps are combined. 当backbone是普通的,Not in high resolutionmap时,这个基础就失去了,This may explain why a simple pyramid is sufficient.)
Why just using simple deconvolution or convolution works better than layering,我们认为这是因为 ViT Position-dependent embedding is possible(positional embedding)来编码位置,Possibly because of high dimensionality ViT patch embeddingsInformation is not necessarily discarded.(个人理解:在FPN中,不同尺度feature map,大尺寸feature map纹理信息丰富,Strong location information,而vit有positional embedding,Certain location information can be learned,同时ViTThere are longer attention distances in deeper blocks,The distances are more limited in the shallower blocks)
- window attention with a few cross-window blocks: 
We focus on pre-trainingbackbone执行全局self-attention的场景,The higher resolution input is then adapted during fine-tuning.而一些方法(例如Swin transfomer)It is directly in pre-training to change the attention calculation to adapt to the higher resolution input.
我们探索使用带有几个跨窗口块的窗口注意力.在微调期间,给定高分辨率特征图,We divide it into regular non-overlapping windows.在每个窗口内计算自注意力.
与 Swin 不同,我们不会跨层“移动”窗口. In order to exchange information between windows,Very few are used(默认情况下,4 个)可以跨窗口的块(cross-window block).We will pre-trainbackbone平均分成 4 个块子集(例如,对于 24 块 ViT-L,每个子集中有 6 个).We apply in the last block of each subsetcross-window block. 我们研究这两种策略:
(1)Global self-attention
Executed in the last block of each subsetGlobal self-attention.由于Global block的数量很少,Memory and computational costs are acceptable.
(2)Convolutional
Add an extra convolutional block after each subset.A convolution block is a residual block consisting of one or more convolutions and an identity mapping branch(residual block).
This makes detection fine-tuning compatible with global self-attention pre-training,没有必要重新设计预训练架构.
Discussion:
Our work is on the detectorbackboneAspects follow the original Ordinary ViT 论文的精神. 虽然 ViT The paper's discussion focuses on reducing the inductive bias of translation equivariance,但在我们的案例中,它是关于在主干中的尺度等变上减少甚至没有归纳偏差. 我们假设普通主干实现尺度等方差的方法是从数据中学习先验知识,类似于它如何在没有卷积的情况下学习平移等方差和局部性.(个人理解:The layered architecture actually introduces a scale-equivariant inductive bias).
4.实验:
在 COCO 数据集上进行消融实验. 我们在 train2017 split 上进行训练并在 val2017 split 上进行评估.The evaluation metric is target detection(APbox)和实例分割(APmask).Used without labels IN-1K 上预训练的 MAE 初始化backbone.

消融实验:The model is normal for use ViT 主干的Mask R-CNN,在 COCO 上评估,对比SFP与FPN以及加入了top-down connections的SFP的效果.

消融实验:The model is normal for use ViT 主干的Mask R-CNN,在 COCO 上评估.
(a)对比不加cross-window模块,and adopted respectivelyglobal self-attention,卷积模块,swin transformer的shifted win设计的效果.
(b)对比不加cross-window模块,And take convolution as cross-window模块时,The effect of different convolutional structures,naive(一个3x3的卷积层),basic(两个3x3的卷积层),bottleneck(1×1→3×3→1×1 的结构).
(c)对比不加cross-window模块,and join in different locationscross-window模块的效果.first 4 blocks(将这4个cross-window模块全部加入到backbone的第一个block),last 4 blocks(将这4个cross-window模块全部加入到backbone的最后一个block),evenly 4 bloacks(将这4个cross-window模块,放置在在backbone的每个blockin the last block)
(d)对比不加cross-window模块,and how much to addcross-window模块的效果.

消融实验:The model is normal for use ViT-L 主干的Mask R-CNN,Contrast without contrastcross-window模块,以及加入cross-windowThe effect of the number and type of modules.
训练内存(每个 GPU)以batch-size为1进行测试.测试时间(每个图像)在 A100 GPU 上进行测试.
Convolution is the most practical,只增加了 ≤5% 的内存和时间,也增加了 4% 的参数.使用 4 global也是可行的,并且不会增加模型大小.所有 24 globalglobal self-attention is impractical.重要的是,All these architectural tweaks are only performed during fine-tuning,No need to redesign the pretrained architecture.

对比实验,Contrast the hierarchical structure with that proposed by the authorsplain-backbone的效果.
我们分别为每个主干搜索最佳超参数.我们的 Swin The results are better than the corresponding results in the original paper;我们的 MViTv2 The results are superior or comparable to those reported in the original paper.
Following the original paper,Swin 和 MViTv2 都使用相对位置偏差. 为了更公平的比较,在我们的 ViT 主干中采用相对位置偏差,但仅在微调期间,不影响预训练,这种添加将 AP 提高了1个点. in all ablation experiments in the previous section,Relative positional deviation is not used.
对于MAEWhy not do a comparison on a layered structure(我们也很好奇MAEpair stratificationbackbone的影响,But that is largely beyond the scope of this article,Because it involves usingMAEfor stratificationbackboneFind a good trainer.同时观察到,在Image-1K上进行MAE预训练的MViTv2-L比Image-21KSupervised pretrainedMViTv2-L好1.3(达到了54.9),但也比ViT-L,ViT-H小.
This shows ordinary ViT Trunks may benefit more than layered trunks MAE 预训练,这表明 MAE The self-supervised training of can compensate for the lack of inductive bias at scale.
分层backboneOften involves enhanced self-attention block design.例如SwinNote and in the transfer windowMViT v1/v2Note in the pool.These blocks are designed,If applied to normalbackbone,Accuracy and parameter efficiency can also be improved.Although this may give our competitors an advantage,But our method is still competitive without these enhancements.

Amount of parameters for different models,FLOP,test time与APbox的对比.结论:越大的模型,the greater the benefit.

对比实验,Compare with what the author proposesViTDetand other mainstream detection networks(Layered or single scale).The detection framework is Cascade Mask RCNN(表示为“Cascade”)、Hybrid Task Cascade(HTC),或其扩展(HTC++).
关于另一种plain-backbone,UViT The aim is to design a method suitable for detection tasksplain-backbone,The authors' method is not task-specific,是通用的.
我们的探索表明,plain-backbone检测是一个很有前途的研究方向.This approach largely maintains the independence of the generic backbone and downstream task-specific designs.
边栏推荐
- Error: Flash Download failed - “Cortex-M4“-STM32F4
- mk文件介绍
- 系统性能及并发数的一些计算公式
- Use c language to implement tic-tac-toe chess (with source code, you can run it directly)
- 关于openlayer中swipe位置偏移的问题
- NUC980-镜像烧录
- 深度学习Matlab工具箱代码注释
- OpenMLDB + Jupyter Notebook: Quickly Build Machine Learning Applications
- vscode插件开发——懒人专用markdown插件开发
- 开源之夏 2022 火热来袭 | 欢迎报名 OpenMLDB 社区项目~
猜你喜欢
 - ARM 汇编指令 ADR 与 LDR 使用 
 - 咕咚vs悦跑圈的竞品分析 
 - 厂商推送平台-华为接入 
 - 第四范式OpenMLDB优化创新论文被国际数据库顶会VLDB录用 
 - Invalid revision: 3.18.1-g262b901-dirty 
 - 华为IOT设备消息上报和消息下发验证 
 - STM32 基于固件库的工程模板的建立 
 - Vscode远程连接服务器终端zsh+Oh-my-zsh + Powerlevel10 + Autosuggestions + Autojump + Syntax-highlighting 
 - 关于openlayer中swipe位置偏移的问题 
 - STM32-库函数-SetSysClock(void)函数解析-正点原子探索者 
随机推荐
- Vscode远程连接服务器终端zsh+Oh-my-zsh + Powerlevel10 + Autosuggestions + Autojump + Syntax-highlighting 
- yolov3+centerloss+replay buffer实现单人物跟踪 
- 谨此留个纪念 
- vmware不可恢复错误vmui 
- 构建面向特征工程的数据生态 ——拥抱开源生态,OpenMLDB全面打通MLOps生态工具链 
- JVM调优整理 
- 论文解读:GAN与检测网络多任务/SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network 
- STM32学习笔记(白话文理解版)—外部IO中断实验 
- 使用ActiveReports制作第一张报表 
- SearchGuard配置 
- 物联网IOT 固件升级 
- MSP430学习总结(二)——GPIO 
- KANO模型——确定需求优先级的神器 
- Diagnostic Log and Trace——DLT 离线日志存储 
- 推出 Space Marketplace 测试版 | 新发布 
- CMT2380F32模块开发1-硬件 
- STM32-串口常用寄存器和库函数及配置串口步骤 
- Node 踩坑之80端口被占用 
- 智能风控中台设计与落地 
- 活动预告 | 4月23日,多场OpenMLDB精彩分享来袭,不负周末好时光