Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

Overview

STCN

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang

[arXiv] [PDF] [Project Page] [Papers with Code]

bmx pigs

We present Space-Time Correspondence Networks (STCN) as the new, effective, and efficient framework to model space-time correspondences in the context of video object segmentation. STCN achieves SOTA results on multiple benchmarks while running fast at 20+ FPS without bells and whistles. Its speed is even higher with mixed precision. Despite its effectiveness, the network itself is very simple with lots of room for improvement. See the paper for technical details.

What do we have here?

  1. A gentle introduction

  2. Quantitative results and precomputed outputs

    1. DAVIS 2016
    2. DAVIS 2017 validation/test-dev
    3. YouTubeVOS 2018/2019
  3. Steps to reproduce

    1. Pretrained models
    2. Inference
    3. Training
  4. If you want to look closer

  5. Citation

A Gentle Introduction

framework

There are two main contributions: STCN framework (above figure), and L2 similarity. We build affinity between images instead of between (image, mask) pairs -- this leads to a significantly speed up, memory saving (because we compute one, instead of multiple affinity matrices), and robustness. We further use L2 similarity to replace dot product, which improves the memory bank utilization by a great deal.

Perks

  • Simple, runs fast (30+ FPS with mixed precision; 20+ without)
  • High performance
  • Still lots of room to improve upon (e.g. locality, memory space compression)
  • Easy to train: just two 11GB GPUs, no V100s needed

Requirements

We used these packages/versions in the development of this project. It is likely that higher versions of the same package will also work. This is not an exhaustive list -- other common python packages (e.g. pillow) are expected and not listed.

  • PyTorch 1.8.1
  • torchvision 0.9.1
  • OpenCV 4.2.0
  • progressbar
  • thinspline for training (pip install git+https://github.com/cheind/py-thin-plate-spline)
  • gitpython for training
  • gdown for downloading pretrained models

Refer to the official PyTorch guide for installing PyTorch/torchvision. The rest can be installed by:

pip install progressbar2 opencv-python gitpython gdown git+https://github.com/cheind/py-thin-plate-spline

Results

Notations

FPS is amortized, computed as total processing time / total number of frames irrespective of the number of objects, aka multi-object FPS, and measured on an RTX 2080 Ti with IO time excluded. We also provide inference speed when Automatic Mixed Precision (AMP) is used. We noticed that the performance is almost identical. Speed in the paper are measured without AMP. All evaluations are done in 480p resolution. FPS for test-dev is measured on the validation set under the same memory setting for consistency.

[Precomputed outputs - Google Drive]

[Precomputed outputs - OneDrive]

s012 denotes models with BL pretraining while s03 denotes those without (used to be called s02 in MiVOS).

Numbers (s012)

Dataset Split J&F J F FPS FPS (AMP)
DAVIS 2016 validation 91.7 90.4 93.0 26.9 40.8
DAVIS 2017 validation 85.3 82.0 88.6 20.2 34.1
DAVIS 2017 test-dev 79.9 76.3 83.5 14.6 22.7
Dataset Split Overall Score J-Seen J-Unseen F-Seen F-Unseen
YouTubeVOS 18 validation 84.3 83.2 79.0 87.9 87.2
YouTubeVOS 19 validation 84.2 82.6 79.4 87.0 87.7
Dataset AUC-J&F J&F @ 60s
DAVIS Interactive 88.4 88.8

For DAVIS interactive, we changed the propagation module of MiVOS from STM to STCN. See this link for details.

Reproducing the results

Pretrained models

We use the same model for YouTubeVOS and DAVIS. You can download them yourself and put them in ./saves/, or use download_model.py.

s012 model (better): [Google Drive] [OneDrive]

s03 model: [Google Drive] [OneDrive]

Inference

  • eval_davis_2016.py for DAVIS 2016 validation set
  • eval_davis.py for DAVIS 2017 validation and test-dev set (controlled by --split)
  • eval_youtube.py for YouTubeVOS 2018/19 validation set (controlled by --yv_path)

The arguments tooltip should give you a rough idea of how to use them. For example, if you have downloaded the datasets and pretrained models using our scripts, you only need to specify the output path: python eval_davis.py --output [somewhere] for DAVIS 2017 validation set evaluation. For YouTubeVOS evaluation, point --yv_path to the version of your choosing.

Training

Data preparation

I recommend either softlinking (ln -s) existing data or use the provided download_datasets.py to structure the datasets as our format. download_datasets.py might download more than what you need -- just comment out things that you don't like. The script does not download BL30K because it is huge (>600GB) and we don't want to crash your harddisks. See below.

├── STCN
├── BL30K
├── DAVIS
│   ├── 2016
│   │   ├── Annotations
│   │   └── ...
│   └── 2017
│       ├── test-dev
│       │   ├── Annotations
│       │   └── ...
│       └── trainval
│           ├── Annotations
│           └── ...
├── static
│   ├── BIG_small
│   └── ...
├── YouTube
│   ├── all_frames
│   │   └── valid_all_frames
│   ├── train
│   ├── train_480p
│   └── valid
└── YouTube2018
    ├── all_frames
    │   └── valid_all_frames
    └── valid

BL30K

BL30K is a synthetic dataset proposed in MiVOS.

You can either use the automatic script download_bl30k.py or download it manually from MiVOS. Note that each segment is about 115GB in size -- 700GB in total. You are going to need ~1TB of free disk space to run the script (including extraction buffer).

Training commands

CUDA_VISIBLE_DEVICES=[a,b] OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port [cccc] --nproc_per_node=2 train.py --id [defg] --stage [h]

We implemented training with Distributed Data Parallel (DDP) with two 11GB GPUs. Replace a, b with the GPU ids, cccc with an unused port number, defg with a unique experiment identifier, and h with the training stage (0/1/2/3).

The model is trained progressively with different stages (0: static images; 1: BL30K; 2: 300K main training; 3: 150K main training). After each stage finishes, we start the next stage by loading the latest trained weight.

(Models trained on stage 0 only cannot be used directly. See model/model.py: load_network for the required mapping that we do.)

The .pth with _checkpoint as suffix is used to resume interrupted training (with --load_model) which is usually not needed. Typically you only need --load_network and load the last network weights (without checkpoint in its name).

So, to train a s012 model, we launch three training steps sequentially as follows:

Pre-training on static images: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s0 --stage 0

Pre-training on the BL30K dataset: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s01 --load_network [path_to_trained_s0.pth] --stage 1

Main training: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s012 --load_network [path_to_trained_s01.pth] --stage 2

And to train a s03 model, we launch two training steps sequentially as follows:

Pre-training on static images: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s0 --stage 0

Main training: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s03 --load_network [path_to_trained_s0.pth] --stage 3

Looking closer

  • To add your datasets, or do something with data augmentations: dataset/static_dataset.py, dataset/vos_dataset.py
  • To work on the similarity function, or memory readout process: model/network.py: MemoryReader, inference_memory_bank.py
  • To work on the network structure: model/network.py, model/modules.py, model/eval_network.py
  • To work on the propagation process: model/model.py, eval_*.py, inference_*.py

Citation

Please cite our paper (MiVOS if you use top-k) if you find this repo useful!

@inproceedings{cheng2021stcn,
  title={Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation},
  author={Cheng, Ho Kei and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={arXiv:2106.05210},
  year={2021}
}

@inproceedings{cheng2021mivos,
  title={Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion},
  author={Cheng, Ho Kei and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={CVPR},
  year={2021}
}

And if you want to cite the datasets:

bibtex

@inproceedings{shi2015hierarchicalECSSD,
  title={Hierarchical image saliency detection on extended CSSD},
  author={Shi, Jianping and Yan, Qiong and Xu, Li and Jia, Jiaya},
  booktitle={TPAMI},
  year={2015},
}

@inproceedings{wang2017DUTS,
  title={Learning to Detect Salient Objects with Image-level Supervision},
  author={Wang, Lijun and Lu, Huchuan and Wang, Yifan and Feng, Mengyang 
  and Wang, Dong, and Yin, Baocai and Ruan, Xiang}, 
  booktitle={CVPR},
  year={2017}
}

@inproceedings{FSS1000,
  title = {FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation},
  author = {Li, Xiang and Wei, Tianhan and Chen, Yau Pun and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={CVPR},
  year={2020}
}

@inproceedings{zeng2019towardsHRSOD,
  title = {Towards High-Resolution Salient Object Detection},
  author = {Zeng, Yi and Zhang, Pingping and Zhang, Jianming and Lin, Zhe and Lu, Huchuan},
  booktitle = {ICCV},
  year = {2019}
}

@inproceedings{cheng2020cascadepsp,
  title={{CascadePSP}: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement},
  author={Cheng, Ho Kei and Chung, Jihoon and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={CVPR},
  year={2020}
}

@inproceedings{xu2018youtubeVOS,
  title={Youtube-vos: A large-scale video object segmentation benchmark},
  author={Xu, Ning and Yang, Linjie and Fan, Yuchen and Yue, Dingcheng and Liang, Yuchen and Yang, Jianchao and Huang, Thomas},
  booktitle = {ECCV},
  year={2018}
}

@inproceedings{perazzi2016benchmark,
  title={A benchmark dataset and evaluation methodology for video object segmentation},
  author={Perazzi, Federico and Pont-Tuset, Jordi and McWilliams, Brian and Van Gool, Luc and Gross, Markus and Sorkine-Hornung, Alexander},
  booktitle={CVPR},
  year={2016}
}

@inproceedings{denninger2019blenderproc,
  title={BlenderProc},
  author={Denninger, Maximilian and Sundermeyer, Martin and Winkelbauer, Dominik and Zidan, Youssef and Olefir, Dmitry and Elbadrawy, Mohamad and Lodhi, Ahsan and Katam, Harinandan},
  booktitle={arXiv:1911.01911},
  year={2019}
}

@inproceedings{shapenet2015,
  title       = {{ShapeNet: An Information-Rich 3D Model Repository}},
  author      = {Chang, Angel Xuan and Funkhouser, Thomas and Guibas, Leonidas and Hanrahan, Pat and Huang, Qixing and Li, Zimo and Savarese, Silvio and Savva, Manolis and Song, Shuran and Su, Hao and Xiao, Jianxiong and Yi, Li and Yu, Fisher},
  booktitle   = {arXiv:1512.03012},
  year        = {2015}
}

Contact: [email protected]

Comments
  • RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched

    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched

    raceback (most recent call last): File "eval_youtube.py", line 133, in processor.interact(with_bg_msk, frame_idx, rgb.shape[1], obj_idx) File "/datasets/MODELS/STCN-ALL/STCN/inference_core_yv.py", line 123, in interact self.do_pass(key_k, key_v, frame_idx, end_idx) File "/datasets/MODELS/STCN-ALL/STCN/inference_core_yv.py", line 84, in do_pass for oi in self.enabled_obj], 0) File "/datasets/MODELS/STCN-ALL/STCN/inference_core_yv.py", line 84, in for oi in self.enabled_obj], 0) File "/datasets/MODELS/STCN-ALL/STCN/model/eval_network.py", line 61, in segment_with_query readout_mem = mem_bank.match_memory(qk16) File "/datasets/MODELS/STCN-ALL/STCN/inference_memory_bank.py", line 60, in match_memory readout_mem = self._readout(affinity.expand(k,-1,-1), mv) File "/datasets/MODELS/STCN-ALL/STCN/inference_memory_bank.py", line 42, in _readout return torch.bmm(mv, affinity) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

    opened by JerryX1110 19
  • Question about training speed.

    Question about training speed.

    First of all, thank you for your great work ! ! ! Conduct the training s0 & s2 with 2×2080Ti should take 30h as your paper. But in practice, I will take 100h just for s0 with2×2080ti (or 1*3090). So I wanted to confirm the training speed. Or maybe what's wrong with me?

    opened by PinxueGuo 14
  • Multiple objects when training

    Multiple objects when training

    Hello @hkchengrex , first of all thanks for your dedicated projects, it really inspires me a lot. This is not a bug report, but a small question.

    I have been running and modifying most of the your code base to adapt for my problem and it works acceptably. However, there is a part of the code that is still unclear to me:

    As I understand (from the code and the paper), STCN trains as a binary segmentation task , but I also notice that in the code when training on VOS you use a second additional class to train as well (by randomly choice from list of classes, if I understand correctly). May I ask what is the meaning of this? I cannot find where it is mentioned in the paper. Is it possible to use more classes ? I have tried training on masks with 2 objects on a different dataset, and it seems that the result is far better than when training with a single mask for each object.

    Thanks in advance

    opened by kaylode 11
  • How to take the prediction mask of each target in the previous frame?

    How to take the prediction mask of each target in the previous frame?

    @hkchengrex Because two targets are fixed during training, prev_mask[:,0:1] and prev_mask[:,1:2] can be used to take the masks of the two targets in the previous frame. However, the number of targets during testing is not fixed. How to take the prediction mask of each target in the previous frame?

    for i, oi in enumerate(self.enabled_obj): self.prob[self.enabled_obj,ti-1].cuda()

    image

    image

    opened by longmalongma 9
  • Can I train your codes with 4 Gpus?

    Can I train your codes with 4 Gpus?

    Can I train code with 4 Gpus? What changes need to be made for the training of 4 Gpus? Also, should 4 Gpus train faster? But will the segmentation accuracy be reduced?

    opened by longmalongma 8
  •  when training, do I only use davis2017 to train or use davis2017 and davis2016 to train together?

    when training, do I only use davis2017 to train or use davis2017 and davis2016 to train together?

    Hello, when training, do I only use davis2017 to train or use davis2017 and davis2016 to train together? By the way, how can I know whether to use davis2017 or davis2017 and davis2016 to train together?

    opened by longmalongma 8
  • Where can I find the key and value of the feature code of the labeled first frame in your test code?

    Where can I find the key and value of the feature code of the labeled first frame in your test code?

    I thought you said this was the key and value coding for the first frame with the label, but why do I encode the key for frame_idx and the value for self.images[:,frame_idx]?

    image

    There is another problem. The dimension of the key here is: torch.Size([1, 512, 1, 30, 54]), and the dimension of the value is torch.Size([3, 512, 1, 30, 54]), The first dimension here is k (num_objects), what does the third dimension represent? Are the key and value here the feature encoding of the first frame with the label? Looking forward to your reply, thank you very much. image

    opened by longmalongma 7
  • Some questions about ablation study

    Some questions about ablation study

    I am curious about the performance under the following setting. Have you tried them before?

    1. STCN w/o V.S. W/ Top-k filtering
    2. STCN V.S. STM with the same training strategy (I find that the pretraining of STCN maybe uses more data and some augmentations.
    opened by JerryX1110 7
  • how to get the figure 6?

    how to get the figure 6?

    Hi, Recently i'm trying to get the correspondences of the rest frames using the first frame of my own video, but the results are bad. Could you open source the codes that can get the correspondences like figure6 in your paper? Thanks in advance.

    opened by Liesy 6
  • Are there any other training techniques that might be useful for replicating your results?

    Are there any other training techniques that might be useful for replicating your results?

    Hi, @hkchengrex ,I have trained and tested several times according to your instructions, but I still cannot reproduce the accuracy in your paper. Are there any other training techniques that might be useful for replicating your results? By the way, will adding -no_amp during training improve the accuracy even further?

    opened by longmalongma 6
  • Number of instances

    Number of instances

    Thanks for your great work and code.

    I am new to this area, it seems that the maximum instance number is 2 for your code. How to deal with images with more than 2 instances?

    opened by Trainingzy 6
  • About loss being NaN,  lr_scheduler.step(), optimizer.step()

    About loss being NaN, lr_scheduler.step(), optimizer.step()

    I've read issue #44. Like that case, I change the ResNet50 to another backbone. So I check the link you mentioned: https://discuss.pytorch.org/t/optimizer-step-before-lr-scheduler-step-error-using-gradscaler/92930/7

    And therefore i change codes as below: change

    But seems losses( all 3 losses) still being NaN and the warning of "UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate" still exsists.

    problem

    Is this normal or do I need to modify losses.py?

    Thank you.

    opened by FlyDre 2
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

9 Nov 14, 2022
DeepMind Alchemy task environment: a meta-reinforcement learning benchmark

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure.

DeepMind 188 Dec 25, 2022
Exploring Image Deblurring via Blur Kernel Space (CVPR'21)

Exploring Image Deblurring via Encoded Blur Kernel Space About the project We introduce a method to encode the blur operators of an arbitrary dataset

VinAI Research 118 Dec 19, 2022
A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

Tom 50 Dec 16, 2022
YOLOv5 detection interface - PyQt5 implementation

所有代码已上传,直接clone后,运行yolo_win.py即可开启界面。 2021/9/29:加入置信度选择 界面是在ultralytics的yolov5基础上建立的,界面使用pyqt5实现,内容较简单,娱乐而已。 功能: 模型选择 本地文件选择(视频图片均可) 开关摄像头

487 Dec 27, 2022
Adjust Decision Boundary for Class Imbalanced Learning

Adjusting Decision Boundary for Class Imbalanced Learning This repository is the official PyTorch implementation of WVN-RS, introduced in Adjusting De

Peyton Byungju Kim 16 Jan 04, 2023
The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

Website | ArXiv | Get Start | Video PIRenderer The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic

Ren Yurui 261 Jan 09, 2023
Next-Best-View Estimation based on Deep Reinforcement Learning for Active Object Classification

next_best_view_rl Setup Clone the repository: git clone --recurse-submodules ... In 'third_party/zed-ros-wrapper': git checkout devel Install mujoco `

Christian Korbach 1 Feb 15, 2022
HarDNeXt: Official HarDNeXt repository

HarDNeXt-Pytorch HarDNeXt: A Stage Receptive Field and Connectivity Aware Convolution Neural Network HarDNeXt-MSEG for Medical Image Segmentation in 0

5 May 26, 2022
This is the implementation of GGHL (A General Gaussian Heatmap Labeling for Arbitrary-Oriented Object Detection)

GGHL: A General Gaussian Heatmap Labeling for Arbitrary-Oriented Object Detection This is the implementation of GGHL 👋 👋 👋 [Arxiv] [Google Drive][B

551 Dec 31, 2022
A basic implementation of Layer-wise Relevance Propagation (LRP) in PyTorch.

Layer-wise Relevance Propagation (LRP) in PyTorch Basic unsupervised implementation of Layer-wise Relevance Propagation (Bach et al., Montavon et al.)

Kai Fabi 28 Dec 26, 2022
TrTr: Visual Tracking with Transformer

TrTr: Visual Tracking with Transformer We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder a

趙 漠居(Zhao, Moju) 66 Dec 27, 2022
Code accompanying the paper "Wasserstein GAN"

Wasserstein GAN Code accompanying the paper "Wasserstein GAN" A few notes The first time running on the LSUN dataset it can take a long time (up to an

3.1k Jan 01, 2023
CompilerGym is a library of easy to use and performant reinforcement learning environments for compiler tasks

CompilerGym is a library of easy to use and performant reinforcement learning environments for compiler tasks

Facebook Research 721 Jan 03, 2023
Cross-modal Deep Face Normals with Deactivable Skip Connections

Cross-modal Deep Face Normals with Deactivable Skip Connections Victoria Fernández Abrevaya*, Adnane Boukhayma*, Philip H. S. Torr, Edmond Boyer (*Equ

72 Nov 27, 2022
Hierarchical User Intent Graph Network for Multimedia Recommendation

Hierarchical User Intent Graph Network for Multimedia Recommendation This is our Pytorch implementation for the paper: Hierarchical User Intent Graph

6 Jan 05, 2023
Model search is a framework that implements AutoML algorithms for model architecture search at scale

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers speed up their exploration process for finding the right model a

Google 3.2k Dec 31, 2022
Equivariant CNNs for the sphere and SO(3) implemented in PyTorch

Equivariant CNNs for the sphere and SO(3) implemented in PyTorch

Jonas Köhler 893 Dec 28, 2022
Experiments with the Robust Binary Interval Search (RBIS) algorithm, a Query-Based prediction algorithm for the Online Search problem.

OnlineSearchRBIS Online Search with Best-Price and Query-Based Predictions This is the implementation of the Robust Binary Interval Search (RBIS) algo

S. K. 1 Apr 16, 2022
Official repository for "Restormer: Efficient Transformer for High-Resolution Image Restoration". SOTA for motion deblurring, image deraining, denoising (Gaussian/real data), and defocus deblurring.

Restormer: Efficient Transformer for High-Resolution Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan,

Syed Waqas Zamir 906 Dec 30, 2022