当前位置：网站首页>[paper reading] [3D object detection] voxel transformer for 3D object detection

[paper reading] [3D object detection] voxel transformer for 3D object detection

2022-04-23 04:37:00 【Lukas88664】

Paper title ：Voxel Transformer for 3D Object Detection

iccv2021
Most of the current practice is to point on the cloud For example, first put the point cloud group turn Then group transformer This article proposes a method based on voxel Of transformer Can be applied based on voxel On the detector Easy to carry out voxel 3d The extraction of global features .
Old rules Upper figure ！
Insert picture description here
It can be seen that the main innovation of the article is 3d Of backbone This means that we can apply this module to all voxel A phase of Above the two-stage detector .
Point cloud voxel Of 3d Convolution is mainly divided into two categories of processing ：sparse and submanifold.
Their operation is basically the same except attending voxel It's just different , These two categories of 3d For operation, please refer to SECOND Three dimensional target detector .
To put it simply, use sparse Take the next sample use submanifold While maintaining sparsity 3d Convolution .
For non empty voxel We have to attending voxel( What is? attending voxel Well Let's define ) Conduct transformer operation Position code: select the relative position code Yes transformer Basic students will understand by looking at the following formula ~
Insert picture description here

about submanifold Layer
its querying voxel All non empty voxel , Well, first of all, there are two kinds of attention operation The output result is added to the input （ One res Layer operation ） And then batch Norm. Then input to the forward propagation layer Conduct submanifold Convolution Another one res layer batch norm layer Last relu Activate Then proceed proj Pay attention to is What we use here is batch norm And the random identification of neurons is cancelled The author believes that this will help the learning process .（ The two mentioned in the article attention Let's explain below ）
Insert picture description here

about sparse Layer
It needs to be in some empty voxel on querying operation And these voxel It's not feature Of We use an estimation function The article says it can be for attending voxel Interpolation and other operations The network directly adopts max pool Obviously, through the self attention layer The output result is different from the output structure So the network framework cancels the previous one res layer .
Insert picture description here
Then let's explain two attention modular
These two kinds of attention The module is mainly composed of attend voxel Divided according to the difference of
local attention
Participate in this module voxel It's our current query voxel Nearby voxel It's probably all nonempty in a convolution size voxel
Insert picture description here
Give them a transformer operation , obviously For the present query voxel Come on His feature Fusion is the combination of all... In the current feeling field voxel and transformer Relative to convolution More receptive to people from nearby feature.

dilated attention
The convolution of this part can refer to sparse convolution The name is similar Mainly to expand the receptive field ：
Insert picture description here
The article says a sparse attention After reasonable attending voxel choice You can make query The range is up to 15m.
Finally, we can understand the above convolution in combination with the diagram of the article ：

After the above two convolution operations We have achieved localfeature and Wider receptive field feature Fusion .
Then the author puts forward a voxel query Quick take non empty voxel Methods The main idea is to put non empty voxel out Make a code You have to be right behind someone voxel Conduct attention When dealing with Directly to attending voxel Just take their code such The complexity of the model is significantly reduced ：
Insert picture description here
The results are very good ：

The necessity of different convolution was compared in Ablation Experiment

Necessity of random inactivation layer ：

attending voxel Number of

Finally, the reasoning speed and size compared with the traditional model are compared
Insert picture description here
I saw it for the first time voxel do trans Relatively new