[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Overview

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021)

Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao

  • This repository provides code for paper "Full-Duplex Strategy for Video Object Segmentation" accepted by the ICCV-2021 conference (arXiv Version / 中译版本).

  • This project is under construction. If you have any questions about our paper or bugs in our git project, feel free to contact me.

  • If you like our FSNet for your personal research, please cite this paper (BibTeX).

1. News

  • [2021/08/24] Upload the training script for video object segmentation.
  • [2021/08/22] Upload the pre-trained snapshot and the pre-computed results on U-VOS and V-SOD tasks.
  • [2021/08/20] Release inference code, evaluation code (VSOD).
  • [2021/07/20] Create Github page.

2. Introduction

Why?

Appearance and motion are two important sources of information in video object segmentation (VOS). Previous methods mainly focus on using simplex solutions, lowering the upper bound of feature collaboration among and across these two cues.


Figure 1: Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appear- ance) and our full-duplex strategy. In contrast, our FS- Net offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue.

What?

In this paper, we study a novel framework, termed the FSNet (Full-duplex Strategy Network), which designs a relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding subspaces. Furthermore, the bidirectional purification module (BPM) is introduced to update the inconsistent features between the spatial-temporal embeddings, effectively improving the model's robustness.


Figure 2: The pipeline of our FSNet. The Relational Cross-Attention Module (RCAM) abstracts more discriminative representations between the motion and appearance cues using the full-duplex strategy. Then four Bidirectional Purification Modules (BPM) are stacked to further re-calibrate inconsistencies between the motion and appearance features. Finally, we utilize the decoder to generate our prediction.

How?

By considering the mutual restraint within the full-duplex strategy, our FSNet performs the cross-modal feature-passing (i.e., transmission and receiving) simultaneously before the fusion and decoding stage, making it robust to various challenging scenarios (e.g., motion blur, occlusion) in VOS. Extensive experiments on five popular benchmarks (i.e., DAVIS16, FBMS, MCL, SegTrack-V2, and DAVSOD19) show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.


Figure 3: Qualitative results on five datasets, including DAVIS16, MCL, FBMS, SegTrack-V2, and DAVSOD19.

3. Usage

How to Inference?

  • Download the test dataset from Baidu Driver (PSW: aaw8) or Google Driver and save it at ./dataset/*.

  • Install necessary libraries: PyTorch 1.1+, scipy 1.2.2, PIL

  • Download the pre-trained weights from Baidu Driver (psw: 36lm) or Google Driver. Saving the pre-trained weights at ./snapshot/FSNet/2021-ICCV-FSNet-20epoch-new.pth

  • Just run python inference.py to generate the segmentation results.

  • About the post-processing technique DenseCRF we used in the original paper, you can find it here: DSS-CRF.

How to train our model from scratch?

Download the train dataset from Baidu Driver (PSW: u01t) or Google Driver Set1/Google Driver Set2 and save it at ./dataset/*. Our training pipeline consists of three steps:

  • First, train the model using the combination of static SOD dataset (i.e., DUTS) with 12,926 samples and U-VOS datasets (i.e., DAVIS16 & FBMS) with 2,373 samples.

    • Set --train_type='pretrain_rgb' and run python train.py in terminal
  • Second, train the model using the optical-flow map of U-VOS datasets (i.e., DAVIS16 & FBMS).

    • Set --train_type='pretrain_flow' and run python train.py in terminal
  • Third, train the model using the pair of frame and optical flow of U-VOS datasets (i.e., DAVIS16 & FBMS).

    • Set --train_type='finetune' and run python train.py in terminal

4. Benchmark

Unsupervised/Zero-shot Video Object Segmentation (U/Z-VOS) task

NOTE: In the U-VOS, all the prediction results are strictly binary. We only adopt the naive binarization algorithm (i.e., threshold=0.5) in our experiments.

  • Quantitative results (NOTE: The following results have slight improvement compared with the reported results in our conference paper):

    mean-J recall-J decay-J mean-F recall-F decay-F T
    FSNet (w/ CRF) 0.834 0.945 0.032 0.831 0.902 0.026 0.213
    FSNet (w/o CRF) 0.823 0.943 0.033 0.833 0.919 0.028 0.213
  • Pre-Computed Results: Please download the prediction results of FSNet, refer to Baidu Driver (psw: ojsl) or Google Driver.

  • Evaluation Toolbox: We use the standard evaluation toolbox from DAVIS16. (Note that all the pre-computed segmentations are downloaded from this link).

Video Salient Object Detection (V-SOD) task

NOTE: In the V-SOD, all the prediction results are non-binary.

4. Citation

@inproceedings{ji2021FSNet,
  title={Full-Duplex Strategy for Video Object Segmentation},
  author={Ji, Ge-Peng and Fu, Keren and Wu, Zhe and Fan, Deng-Ping and Shen, Jianbing and Shao, Ling},
  booktitle={IEEE ICCV},
  year={2021}
}

5. Acknowledgements

Many thanks to my collaborator Ph.D. Zhe Wu, who provides excellent work SCRN and design inspirations.

Owner
Daniel-Ji
Computer Vision & Medical Imaging
Daniel-Ji
Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

Tencent YouTu Research 64 Nov 11, 2022
This repository is the official implementation of Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models

Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models Link to paper Abstract We study prediction of future out

Rickard Karlsson 2 Aug 19, 2022
Config files for my GitHub profile.

Canalyst Candas Data Science Library Name Canalyst Candas Description Built by a former PM / analyst to give anyone with a little bit of Python knowle

Canalyst Candas 13 Jun 24, 2022
Send text to girlfriend in the morning

Girlfriend Text Send text to girlfriend (or really anyone with a phone number) in the morning 1. Configure your settings in utils.py. phone_number = "

Paras Adhikary 199 Oct 25, 2022
learned_optimization: Training and evaluating learned optimizers in JAX

learned_optimization: Training and evaluating learned optimizers in JAX learned_optimization is a research codebase for training learned optimizers. I

Google 533 Dec 30, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

364 Dec 14, 2022
Image Restoration Using Swin Transformer for VapourSynth

SwinIR SwinIR function for VapourSynth, based on https://github.com/JingyunLiang/SwinIR. Dependencies NumPy PyTorch, preferably with CUDA. Note that t

Holy Wu 11 Jun 19, 2022
Using Tensorflow Object Detection API to detect Waymo open dataset

Waymo-2D-Object-Detection Using Tensorflow Object Detection API to detect Waymo open dataset Result CenterNet Training Loss SSD ResNet Training Loss C

76 Dec 12, 2022
A Machine Teaching Framework for Scalable Recognition

MEMORABLE This repository contains the source code accompanying our ICCV 2021 paper. A Machine Teaching Framework for Scalable Recognition Pei Wang, N

2 Dec 08, 2021
BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting Updated on December 10, 2021 (Release all dataset(2021 videos)) Updated o

weijiawu 47 Dec 26, 2022
PASSL包含 SimCLR,MoCo,BYOL,CLIP等基于对比学习的图像自监督算法以及 Vision-Transformer,Swin-Transformer,BEiT,CVT,T2T,MLP_Mixer等视觉Transformer算法

PASSL Introduction PASSL is a Paddle based vision library for state-of-the-art Self-Supervised Learning research with PaddlePaddle. PASSL aims to acce

186 Dec 29, 2022
Educational API for 3D Vision using pose to control carton.

Educational API for 3D Vision using pose to control carton.

41 Jul 10, 2022
i3DMM: Deep Implicit 3D Morphable Model of Human Heads

i3DMM: Deep Implicit 3D Morphable Model of Human Heads CVPR 2021 (Oral) Arxiv | Poject Page This project is the official implementation our work, i3DM

Tarun Yenamandra 60 Jan 03, 2023
LQM - Improving Object Detection by Estimating Bounding Box Quality Accurately

Improving Object Detection by Estimating Bounding Box Quality Accurately Abstract Object detection aims to locate and classify object instances in ima

IM Lab., POSTECH 0 Sep 28, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Hiring research interns for visual transformer

Multimedia Research 484 Dec 29, 2022
Lightweight plotting to the terminal. 4x resolution via Unicode.

Uniplot Lightweight plotting to the terminal. 4x resolution via Unicode. When working with production data science code it can be handy to have plotti

Olav Stetter 203 Dec 29, 2022
Semi-supervised Domain Adaptation via Minimax Entropy

Semi-supervised Domain Adaptation via Minimax Entropy (ICCV 2019) Install pip install -r requirements.txt The code is written for Pytorch 0.4.0, but s

Vision and Learning Group 243 Jan 09, 2023
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

107 Dec 02, 2022
Meta graph convolutional neural network-assisted resilient swarm communications

Resilient UAV Swarm Communications with Graph Convolutional Neural Network This repository contains the source codes of Resilient UAV Swarm Communicat

62 Dec 06, 2022
Datasets, Transforms and Models specific to Computer Vision

torchvision The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision. Installat

13.1k Jan 02, 2023