当前位置：网站首页>M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation

M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation

2022-08-10 13:08:00 【byzy】

M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation 论文笔记

原文链接：https://arxiv.org/pdf/2204.05088.pdf

1.引言

This paper design a joint training multiple view images3D目标检测和BEVThe unity of the segmentation task network.此外,This paper puts forward a few methods can improve accuracy and reduce the space cost：

高效BEV编码器：使用“空间到通道”（S2C）操作将4DVoxel tensor into3D BEVTensor in order to avoid to use3D卷积;
针对3DTesting the dynamic distribution of box：To learn the matching strategy for real boundary box distribution of anchor box;
针对BEV分割的BEV中心性：考虑到BEVThe remote area corresponding to the image pixels less,According to the pixel to the car distance weighted again,For long-distance distribution of pixels greater weight.
在2DIn image encoder2DTest pilot training and辅助监督：To accelerate the training process,并提升性能.

3.方法

整体结构如下图所示.

3.1 M²BEV流程

整体结构：Multiple view images first input image encoder to get2D特征,然后投影到3DSpace for voxel.Voxel input into efficientBEV编码器得到BEV特征,Finally using detection head and division head to predict.

2D图像编码器：For each view image,使用共享的CNN网络（如ResNet）以及特征金字塔网络（FPN）To establish the multi-level characteristics,Then the samples to the same size,拼接后通过 $1\times1$ Convolution integration by tensor.

2D到3D投影：This module is to make the key module of this method can implement multitasking training.This module will multiview characteristic combination and projection for voxel,见3.2节.Voxel features include images from different view,Therefore is a unified expression.

3D BEV编码器：目的是将轴压缩,得到BEV特征.The most direct method is inAxis using multiple step with long3D卷积,But this method is slow and inefficient.本文使用“空间到通道”（S2C）操作,见3.3节.

3D检测头：基于BEV特征,Can be used directly based on laser radar detecting head.本文直接使用PointPillars的检测头（Generate dense3DAnchor box and then predicted class、尺寸和朝向）,仅包含3个并行的 $1\times1$ 卷积.But on the allocation policy,This article USES the dynamic distribution of box,见3.3节.

BEV分割头：由若干 $3\times3$ 卷积和一个 $1\times1$ 卷积组成.还使用了BEVPixels of the strategy of central weighted loss again,见3.3节.

3.2 高效的2D到3D投影

设为内参矩阵,为外参矩阵,为图像,As voxel tensor,The projection formula for

$[P_{i,j},D]=I\cdot E\cdot V_{i,j,k}$

其中为像素 $P_{i,j}$ 的深度.若未知,则Each pixel in the correspondingA set of points in the camera ray in.

This article assumes that for uniform distribution along the depth of the camera ray,That all characteristics of voxel ray of all pixels and the pixels feature the same,如下图所示.So that can join a number by reducing can learn to improve the efficiency of computing and storage.

3.3 改进设计

高效BEV编码器：S2C操作即将4DVoxel tensor $V\in\mathbb{R}^{X\times Y\times Z\times C}$ 变形为 $X\times Y\times(ZC)$ 大小的3D张量,然后使用2DConvolution reduce channel dimension.

Dynamic distribution of box：Usually based onIoUThreshold of manual matching method in this paper may be sub-optimal,因为BEVFeatures without considering depth,Contains the geometric information is not accurate.So this article use learning matching strategy（类似FreeAnchor方法）：During practice first forecast for each anchor box category and location,再基于IoUFor each of the true boundary box select a batch of anchor box,Then use the classification scores and location precision of the weighted and to distinguish is anchor box.这样,Classification and positioning of the anchor box of the great error in the lower scores for negative anchor box.

BEV中心性：

中心性（centerness）The concept is widely used in2D检测,Weighted samples are again.

考虑到BEVThe remote area corresponding to the image pixels less,So you need to make the model more attention to the area.Centricity are defined as follows：

$\textup{BEV Centerness}=1+\sqrt{\frac{(x_i-x_c)^2+(y_i-y_c)^2}{(\max(x_i)-x_c)^2+(\max(y_i)-y_c)^2}}$

其中 (x_i,y_i) 是BEV点的坐标, (x_c,y_c) 是BEV中心坐标（Is the car coordinate）.该值在1到2的范围内,As the loss function of calculating weight,The more severe punishment to give the segmentation error.

The introduction of the experiment show that the value of the segmentation precision of different distance were improve,And the farther the distance increase, the greater the.

2D检测预训练：如下图所示,在大型2DOn test data set can be improved after training beforehand3D检测性能.

2D辅助监督：如图1所示,After get multi-scale image characteristics,添加2DDetecting head and calculation and2D真实边界框（由3DReal boundary box projection to the image after;见下图）的损失.该2DTesting head use only in training.

2DDetection as a preliminary training and supervision can improve the image characteristics of objects of perception,从而提高3D检测性能.

3.4 训练损失

The final loss is3D检测损失、BEVSplit loss and auxiliary2DDetection of loss sum.

3DDetection of loss andPointPillars相同,即

$L_{det_{3d}}=\frac{1}{N_{pos}}(\beta_{cls}L_{cls}+\beta_{loc}L_{loc}+\beta_{dir}L_{dir})$

Which classification loss asfocal损失;Bounding box loss asSmoothL1损失,包含位置、尺寸、Speed and heading Angle;Toward the classification loss for dual cross entropy.

BEV分割损失为DiceDamage and loss to the dual cross entropy weighting and.

辅助2DDetection of loss andFCOS中相同,为分类损失、Bounding box loss and centricity loss sum.

4.实验

4.2 与SotA的比较

3D检测：In this paper, the method can more than need additional training data method.

BEV分割：The method greatly more thanLSS,Showed the depth estimation toBEVIt is not necessary segmentation.

此外,Compared with the separate training two tasks,The performance of the joint training with a slight decline in.

4.3 消融研究

3D检测：Dynamic allocation and2DTest pilot training can significantly improve performance;S2C操作和2DDetection of auxiliary supervision can slightly improve performance.

预训练Performance improvement depends on the data set betweendomain gap;And the rest of the3DDetection model can improve performance by pre training;Preliminary training can accelerate network convergence in addition.

此外,After preliminary training,Finally use only half of the training data can achieve similar results and all training data.This shows that are more likely to get2DMark can reduce the3D标注的需求.

BEV分割：2DTest pilot training andS2COperation can significantly improve performance;BEVCenter performance improvements in the distant segmentation result.高效BEV编码器的2DConvolution can be stacked to refine moreBEV特征,And still than3DConvolution stacked more efficient.

多任务联合训练：3D检测和BEVSegmentation can promote each other,This may be because the position of the object distribution and map not strong correlation（Like many cars is not located in a zone can drive）.But the joint network greatly saves infer that time,其重要性不可忽略.

运行效率：Compared with the fusion results in post-processing phaseFCOSAnd the use of expensivetransformer解码器的DETR3D,This method is more efficient.

The calibration error robustness：测试时,And low noise outside circumstances,The method has a relatively strong robustness,But further increases the reference noise makes a significant decline in performance.

4.4 局限性

In the complex road conditions may appear error prediction;With the method based on laser radar still has certain gap;When testing the inevitable and noise can cause performance degradation.