当前位置:网站首页>Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer Paper Notes

Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer Paper Notes

2022-08-10 13:07:00 byzy

原文链接:https://arxiv.org/abs/2206.04584

1.引言

        Convert image features to BEVWhether to explicitly use geometry information when making features,Current methods can be divided into geometry-based point-by-point transformations and geometry-free global transformations.

        前者(左图)Use the camera-calibrated extrinsic and extrinsic parameters to build the image pixel toBEVGrid correspondence.But this method relies on too much calibration data,In practice the camera may be offset from the calibration position,lead to unstable correspondence;此外,Often complex and time-consuming operations such as dense depth distribution estimation are required、The feature propagates along the ray toBEV空间等等.

        后者(右图)Elongate image features,每个BEVThe grid interacts with all image features.This method view transformation does not require geometric priors,So insensitive to camera offset.However, the computational complexity of this method is positively related to the number of image pixels,There is a contradiction between efficiency and resolution;Since there is no geometrical prior guidance,Models need to mine discriminative information from all views,makes convergence difficult.

        This paper proposes a geometrically guided kernelTransformer(GKT),Use camera parameters as a guide without relying too much.When camera shift occurs,The corresponding nuclear regions also move,But can also cover the target,Makes the method insensitive to camera offset.The attention weights of the kernel regions are dynamically generated according to the offset.

        GKTUse lookup table indexes,Get rid of the point-by-point transform2D-3D映射操作,提高运行效率.Compared to global transformation,GKTNo global interaction is required,Focus only on the nuclear region guided by the geometry,Has faster running speed and convergence speed.因此GKTBalanced point-wise and global transformations.

2.方法

2.1 The core of geometric guidanceTransformer

        上图为GKT的框架.One of the multi-view images is shared throughCNNThe backbone extracts multi-scale features.BEVOne for each grid of space3D坐标P_i=(x_i,y_i.z)and a query to embedq_i,其中zis a predefined height shared by all grids.将PiRoughly projected to image coordinates by camera extrinsic and extrinsic parameters and rounded,用于指导transformerPay attention to the corresponding area:

Q_i^{sv}=K^{sv}\cdot Rt^{sv}\cdot P^{sv}_i;\;\; \; \bar{Q}_i^{sv}=\texttt{round}(Q^{sv}_i)

其中sIndex feature scale,v索引视图.

        然后在\bar{Q}^{sv}_iConsider nearbyK_h\times K_wthe nuclear region,每个查询q_iwith each view、All feature interactions within the corresponding kernel region at each scale(Some features beyond the image range are set to 0).

2.2 Robustness to camera offset

        Decomposes camera offsets into rotational and translational offsets.where the translation offset is

T_{devi}=\begin{bmatrix} 1 & 0 & 0 & \Delta x\\ 0 & 1 & 0 & \Delta y\\ 0 & 0 & 1 & \Delta z\\ 0 & 0 & 0 & 1 \end{bmatrix}

The rotation offset is

R_{devi}=R_{\theta_x}\cdot R_{\theta_y}\cdot R_{\theta_z}

其中

R_{\theta_x}=\begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & \cos(\theta_x) & \sin(\theta_x) & 0\\ 0 & -\sin(\theta_x) & \cos(\theta_x) & 0\\ 0 & 0 & 0 & 1 \end{bmatrix}

R_{\theta_y}=\begin{bmatrix} \cos(\theta_y) & 0 & -\sin(\theta_y) & 0\\ 0 & 1 & 0 & 0\\ \sin(\theta_y) & 0 & \cos(\theta_y) & 0\\ 0 & 0 & 0 & 1 \end{bmatrix}

R_{\theta_z}=\begin{bmatrix} \cos(\theta_z) & \sin(\theta_z) & 0 & 0\\ -\sin(\theta_z) & \cos(\theta_z) & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{bmatrix}

Noise random variable\Delta x,\Delta y,\Delta z,\theta_x,\theta_y,\theta_z满足

\Delta x,\Delta y,\Delta z\sim N(0,\sigma_1^2);\; \; \; \theta_x,\theta_y,\theta_z\sim N(0,\sigma^2_2)

After adding offset noise,The formula in the previous section becomes

Q_i^{sv}=K^{sv}\cdot R_{devi}\cdot T_{devi}\cdot Rt^{sv} \cdot P^{sv}_i

        Since the rounding operation is noise resistant,So small shifts do not change the nuclear area;Even a slightly larger offset,The nuclear zone can still cover the target,And the attention weight can be dynamically adjusted according to the offset.

2.3 BEV到2Dlookup table index

        每个BEVThe kernel area corresponding to the grid is fixed,可离线计算.each before runningBEVThe pixel indices corresponding to the grid are stored in a lookup table,Features at the corresponding location can be found directly and efficiently at runtime.

2.4 核的配置

        The kernel size can be flexibly configured to balance the receptive field and computational cost;Because of the lookup table index,The layout of the cores can also be chosen arbitrarily(such as cross-shaped nuclei、Expansion core, etc).

3.实验

        实施细节:预设的BEVThe grid resolution is lower,High resolution is obtained by upsampling and convolution blocks before segmentationBEV网格,for map segmentation.

        主要结果:The method of this paper is in all实时The method is the fastest and has the best performance,Although far from real-timeBEVFormer有更好的性能.

        Robustness to camera offset:The experiments examine the performance degradation under different noise variances,found under certain noiseGKTcan maintain comparable performance.And it is found that larger kernels are more robust,And the length in the vertical direction has a greater effect.这可能是因为BEV网格的z是预定义的,There is greater uncertainty.

        when there is no noise,GKTThe kernel used is the vertical kernel(The horizontal width is 1),能达到最好的性能.

        对BEVHigh robustness:由于GKTOnly rough projections are used,Hence the defaultz值不敏感.

        收敛速度:The introduction of geometric priors makes GKT的收敛速度比CVT(Methods using global transformations)快,And can achieve better results in a short period of training.

        GKTComparison of different implementations

  1. Im2col:Split the image into columns,Each column represents a nuclear region,为BEVThe query selects the corresponding nuclear region.This method requires a lot of storage space.
  2. 网格采样:All features in the nuclear region are sampled and concatenated.
  3. Lookup table index:如前文所述.

        inference speed,The lookup table index method is the fastest.

原网站

版权声明
本文为[byzy]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/222/202208101213153433.html