MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition

Overview

MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition

Paper: MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition accepted for International Joint Conference on Neural Networks (IJCNN) 2021 ArXiv

Jacek Komorowski, Monika Wysoczańska, Tomasz Trzciński

Warsaw University of Technology

Our other projects

  • MinkLoc3D: Point Cloud Based Large-Scale Place Recognition (WACV 2021): MinkLoc3D
  • Large-Scale Topological Radar Localization Using Learned Descriptors (ICONIP 2021): RadarLoc
  • EgonNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale (IEEE Robotics and Automation Letters April 2022): EgoNN

Introduction

We present a discriminative multimodal descriptor based on a pair of sensor readings: a point cloud from a LiDAR and an image from an RGB camera. Our descriptor, named MinkLoc++, can be used for place recognition, re-localization and loop closure purposes in robotics or autonomous vehicles applications. We use late fusion approach, where each modality is processed separately and fused in the final part of the processing pipeline. The proposed method achieves state-of-the-art performance on standard place recognition benchmarks. We also identify dominating modality problem when training a multimodal descriptor. The problem manifests itself when the network focuses on a modality with a larger overfit to the training data. This drives the loss down during the training but leads to suboptimal performance on the evaluation set. In this work we describe how to detect and mitigate such risk when using a deep metric learning approach to train a multimodal neural network.

Overview

Citation

If you find this work useful, please consider citing:

@INPROCEEDINGS{9533373,  
   author={Komorowski, Jacek and Wysoczańska, Monika and Trzcinski, Tomasz},  
   booktitle={2021 International Joint Conference on Neural Networks (IJCNN)},   
   title={MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition},   
   year={2021},  
   doi={10.1109/IJCNN52387.2021.9533373}
}

Environment and Dependencies

Code was tested using Python 3.8 with PyTorch 1.9.1 and MinkowskiEngine 0.5.4 on Ubuntu 20.04 with CUDA 10.2.

The following Python packages are required:

  • PyTorch (version 1.9.1)
  • MinkowskiEngine (version 0.5.4)
  • pytorch_metric_learning (version 1.0 or above)
  • tensorboard
  • colour_demosaicing

Modify the PYTHONPATH environment variable to include absolute path to the project root folder:

export PYTHONPATH=$PYTHONPATH:/home/.../MinkLocMultimodal

Datasets

MinkLoc++ is a multimodal descriptor based on a pair of inputs:

  • a 3D point cloud constructed by aggregating multiple 2D LiDAR scans from Oxford RobotCar dataset,
  • a corresponding RGB image from the stereo-center camera.

We use 3D point clouds built by authors of PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition paper (link). Each point cloud is built by aggregating 2D LiDAR scans gathered during the 20 meter vehicle traversal. For details see PointNetVLAD paper or their github repository (link). You can download training and evaluation point clouds from here (alternative link).

After downloading the dataset, you need to edit config_baseline_multimodal.txt configuration file (in config folder). Set dataset_folder parameter to point to a root folder of PointNetVLAD dataset with 3D point clouds. image_path parameter must be a folder where downsampled RGB images from Oxford RobotCar dataset will be saved. The folder will be created by generate_rgb_for_lidar.py script.

Generate training and evaluation tuples

Run the below code to generate training pickles (with positive and negative point clouds for each anchor point cloud) and evaluation pickles. Training pickle format is optimized and different from the format used in PointNetVLAD code.

cd generating_queries/ 

# Generate training tuples for the Baseline Dataset
python generate_training_tuples_baseline.py --dataset_root 
   
    

# Generate training tuples for the Refined Dataset
python generate_training_tuples_refine.py --dataset_root 
    
     

# Generate evaluation tuples
python generate_test_sets.py --dataset_root 
     

     
    
   

is a path to dataset root folder, e.g. /data/pointnetvlad/benchmark_datasets/. Before running the code, ensure you have read/write rights to , as training and evaluation pickles are saved there.

Downsample RGB images and index RGB images linked with each point cloud

RGB images are taken directly from Oxford RobotCar dataset. First, you need to download stereo camera images from Oxford RobotCar dataset. See dataset website for details (link). After downloading Oxford RobotCar dataset, run generate_rgb_for_lidar.py script. The script finds 20 closest RGB images in RobotCar dataset for each 3D point cloud, downsamples them and saves them in the target directory (image_path parameter in config_baseline_multimodal.txt). During the training an input to the network consists of a 3D point cloud and one RGB image randomly chosen from these 20 corresponding images. During the evaluation, a network input consists of a 3D point cloud and one RGB image with the closest timestamp.

cd scripts/ 

# Generate training tuples for the Baseline Dataset
python generate_rgb_for_lidar.py --config ../config/config_baseline_multimodal.txt --oxford_root 
   

   

Training

MinkLoc++ can be used in unimodal scenario (3D point cloud input only) and multimodal scenario (3D point cloud + RGB image input). To train MinkLoc++ network, download and decompress the 3D point cloud dataset and generate training pickles as described above. To train the multimodal model (3D+RGB) download the original Oxford RobotCar dataset and extract RGB images corresponding to 3D point clouds as described above. Edit the configuration files:

  • config_baseline_multimodal.txt when training a multimodal (3D+RGB) model
  • config_baseline.txt and config_refined.txt when train unimodal (3D only) model

Set dataset_folder parameter to the dataset root folder, where 3D point clouds are located. Set image_path parameter to the path with RGB images corresponding to 3D point clouds, extracted from Oxford RobotCar dataset using generate_rgb_for_lidar.py script (only when training a multimodal model). Modify batch_size_limit parameter depending on the available GPU memory. Default limits requires 11GB of GPU RAM.

To train the multimodal model (3D+RGB), run:

cd training

python train.py --config ../config/config_baseline_multimodal.txt --model_config ../models/minklocmultimodal.txt

To train a unimodal model (3D only) model run:

cd training

# Train unimodal (3D only) model on the Baseline Dataset
python train.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt

# Train unimodal (3D only) model on the Refined Dataset
python train.py --config ../config/config_refined.txt --model_config ../models/minkloc3d.txt

Pre-trained Models

Pretrained models are available in weights directory

  • minkloc_multimodal.pth multimodal model (3D+RGB) trained on the Baseline Dataset with corresponding RGB images
  • minkloc3d_baseline.pth unimodal model (3D only) trained on the Baseline Dataset
  • minkloc3d_refined.pth unimodal model (3D only) trained on the Refined Dataset

Evaluation

To evaluate pretrained models run the following commands:

cd eval

# To evaluate the multimodal model (3D+RGB only) trained on the Baseline Dataset
python evaluate.py --config ../config/config_baseline_multimodal.txt --model_config ../models/minklocmultimodal.txt --weights ../weights/minklocmultimodal_baseline.pth

# To evaluate the unimodal model (3D only) trained on the Baseline Dataset
python evaluate.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt --weights ../weights/minkloc3d_baseline.pth

# To evaluate the unimodal model (3D only) trained on the Refined Dataset
python evaluate.py --config ../config/config_refined.txt --model_config ../models/minkloc3d.txt --weights ../weights/minkloc3d_refined.pth

Results

MinkLoc++ performance (measured by Average [email protected]%) compared to the state of the art:

Multimodal model (3D+RGB) trained on the Baseline Dataset extended with RGB images

Method Oxford ([email protected]) Oxford ([email protected]%)
CORAL [1] 88.9 96.1
PIC-Net [2] 98.2
MinkLoc++ (3D+RGB) 96.7 99.1

Unimodal model (3D only) trained on the Baseline Dataset

Method Oxford ([email protected]%) U.S. ([email protected]%) R.A. ([email protected]%) B.D ([email protected]%)
PointNetVLAD [3] 80.3 72.6 60.3 65.3
PCAN [4] 83.8 79.1 71.2 66.8
DAGC [5] 87.5 83.5 75.7 71.2
LPD-Net [6] 94.9 96.0 90.5 89.1
EPC-Net [7] 94.7 96.5 88.6 84.9
SOE-Net [8] 96.4 93.2 91.5 88.5
NDT-Transformer [10] 97.7
MinkLoc3D [9] 97.9 95.0 91.2 88.5
MinkLoc++ (3D-only) 98.2 94.5 92.1 88.4

Unimodal model (3D only) trained on the Refined Dataset

Method Oxford ([email protected]%) U.S. ([email protected]%) R.A. ([email protected]%) B.D ([email protected]%)
PointNetVLAD [3] 80.1 94.5 93.1 86.5
PCAN [4] 86.4 94.1 92.3 87.0
DAGC [5] 87.8 94.3 93.4 88.5
LPD-Net [6] 94.9 98.9 96.4 94.4
SOE-Net [8] 96.4 97.7 95.9 92.6
MinkLoc3D [9] 98.5 99.7 99.3 96.7
MinkLoc++ (RGB-only) 98.4 99.7 99.3 97.4
  1. Y. Pan et al., "CORAL: Colored structural representation for bi-modal place recognition", preprint arXiv:2011.10934 (2020)
  2. Y. Lu et al., "PIC-Net: Point Cloud and Image Collaboration Network for Large-Scale Place Recognition", preprint arXiv:2008.00658 (2020)
  3. M. A. Uy and G. H. Lee, "PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition", 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  4. W. Zhang and C. Xiao, "PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval", 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  5. Q. Sun et al., "DAGC: Employing Dual Attention and Graph Convolution for Point Cloud based Place Recognition", Proceedings of the 2020 International Conference on Multimedia Retrieval
  6. Z. Liu et al., "LPD-Net: 3D Point Cloud Learning for Large-Scale Place Recognition and Environment Analysis", 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  7. L. Hui et al., "Efficient 3D Point Cloud Feature Learning for Large-Scale Place Recognition" preprint arXiv:2101.02374 (2021)
  8. Y. Xia et al., "SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud based Place Recognition", 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  9. J. Komorowski, "MinkLoc3D: Point Cloud Based Large-Scale Place Recognition", Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), (2021)
  10. Z. Zhou et al., "NDT-Transformer: Large-scale 3D Point Cloud Localisation Using the Normal Distribution Transform Representation", 2021 IEEE International Conference on Robotics and Automation (ICRA)
  • J. Komorowski, M. Wysoczanska, T. Trzcinski, "MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition", accepted for International Joint Conference on Neural Networks (IJCNN), (2021)

License

Our code is released under the MIT License (see LICENSE file for details).

PyTorch implementation of D2C: Diffuison-Decoding Models for Few-shot Conditional Generation.

D2C: Diffuison-Decoding Models for Few-shot Conditional Generation Project | Paper PyTorch implementation of D2C: Diffuison-Decoding Models for Few-sh

Jiaming Song 90 Dec 27, 2022
SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

SmallInitEmb LayerNorm(SmallInit(Embedding)) in a Transformer I find that when t

PENG Bo 11 Dec 25, 2022
Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

tisane Tisane: Authoring Statistical Models via Formal Reasoning from Conceptual and Data Relationships TL;DR: Analysts can use Tisane to author gener

Eunice Jun 11 Nov 15, 2022
Image classification for projects and researches

This is a tool to help you quickly solve classification problems including: data analysis, training, report results and model explanation.

Nguyễn Trường Lâu 2 Dec 27, 2021
Empower Sequence Labeling with Task-Aware Language Model

LM-LSTM-CRF Check Our New NER Toolkit 🚀 🚀 🚀 Inference: LightNER: inference w. models pre-trained / trained w. any following tools, efficiently. Tra

Liyuan Liu 838 Jan 05, 2023
Let's create a tool to convert Thailand budget from PDF to CSV.

thailand-budget-pdf2csv Let's create a tool to convert Thailand Government Budgeting from PDF to CSV! รวมพลัง Dev แปลงงบ จาก PDF สู่ Machine-readable

Kao.Geek 88 Dec 19, 2022
Code for You Only Cut Once: Boosting Data Augmentation with a Single Cut

You Only Cut Once (YOCO) YOCO is a simple method/strategy of performing augmenta

88 Dec 28, 2022
Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

DSPoint Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion" Coming soon, as soon as I finish a

Ziyao Zeng 14 Feb 26, 2022
利用yolov5和TensorRT从0到1实现目标检测的模型训练到模型部署全过程

写在前面 利用TensorRT加速推理速度是以时间换取精度的做法,意味着在推理速度上升的同时将会有精度的下降,不过不用太担心,精度下降微乎其微。此外,要有NVIDIA显卡,经测试,CUDA10.2可以支持20系列显卡及以下,30系列显卡需要CUDA11.x的支持,并且目前有bug。 默认你已经完成了

Helium 6 Jul 28, 2022
A Pytorch loader for MVTecAD dataset.

MVTecAD A Pytorch loader for MVTecAD dataset. It strictly follows the code style of common Pytorch datasets, such as torchvision.datasets.CIFAR10. The

Jiyuan 1 Dec 27, 2021
Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

512x512 flowers after 12 hours of training, 1 gpu 256x256 flowers after 12 hours of training, 1 gpu Pizza 'Lightweight' GAN Implementation of 'lightwe

Phil Wang 1.5k Jan 02, 2023
Implementation of "Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner"

Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner This repository is the official implementation of Meta-rPPG: Remote Heart Ra

Eugene Lee 137 Dec 13, 2022
FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

FPGA & FreeNet Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification by Zhuo Zheng, Yanfei Zhong, Ailong M

Zhuo Zheng 92 Jan 03, 2023
QT Py Media Knob using rotary encoder & neopixel ring

QTPy-Knob QT Py USB Media Knob using rotary encoder & neopixel ring The QTPy-Knob features: Media knob for volume up/down/mute with "qtpy-knob.py" Cir

Tod E. Kurt 56 Dec 30, 2022
Offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation

Shunted Transformer This is the offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation by Sucheng Ren, Daquan Zhou, Shengf

156 Dec 27, 2022
[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdhury, Yongxin Yan

Ayan Kumar Bhunia 44 Dec 12, 2022
[ICCV 2021 Oral] NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

NerfingMVS Project Page | Paper | Video | Data NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo Yi Wei, Shaohui

Yi Wei 369 Dec 24, 2022
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Matthias Fey 139 Dec 25, 2022
SoGCN: Second-Order Graph Convolutional Networks

SoGCN: Second-Order Graph Convolutional Networks This is the authors' implementation of paper "SoGCN: Second-Order Graph Convolutional Networks" in Py

Yuehao 7 Aug 16, 2022
StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmenta

NVIDIA Research Projects 3.2k Dec 30, 2022