Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang1,3, Xinggang Wang1 πŸ“§ , Yu Li4, Yuxin Fang1, Jiemin Fang1,2, Wenyu Liu1, Xun Zhao3, Ying Shan3.

1 School of EIC, HUST, 2 AIA, HUST, 3 ARC Lab, Tencent PCG, 4 IDEA.

( πŸ“§ ) corresponding author.


  • This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
  • TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., QueryInst.
  • We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Overall Arch

Models and Main Results

  • We provide both checkpoints and codalab server submissions on YouTube-VIS-2019 dataset.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_MsgShifT 46.3 70.6 50.9 45.2 54.3 link link
TeViT_MsgShifT_MST 46.9 70.1 52.9 45.0 53.4 link link
  • We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
  • Besides basic models, we also provide TeViT with ResNet-50 and Swin-L backbone, models are also trained on YouTube-VIS-2019 dataset.
  • MST denotes multi-scale traning.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_R50 42.1 67.8 44.8 41.3 49.9 link link
TeViT_Swin-L_MST 56.8 80.6 63.1 52.0 63.3 link link
  • Due to backbone limitations, TeViT models with ResNet-50 and Swin-L backbone are conducted with STQI Head only (i.e., without our proposed messenger shift mechanism).
  • With Swin-L as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.

Installation

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Prepare

  • Clone the repository locally:
git clone https://github.com/hustvl/TeViT.git
  • Create a conda virtual environment and activate it:
conda create --name tevit python=3.7.7
conda activate tevit
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI
  • Install Python requirements
torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt
  • Please follow Docs to install MMDetection
python setup.py develop
  • Download YouTube-VIS 2019 dataset from here, and organize dataset as follows:
TeViT
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ youtubevis
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   β”œβ”€β”€ 003234408d
β”‚   β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ val
β”‚   β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ train.json
β”‚   β”‚   β”‚   β”œβ”€β”€ valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

  • Download the COCO pretrained QueryInst with PVT-B1 backbone from here.
  • Train TeViT with 8 GPUs:
./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • Train TeViT with multi-scale data augmentation:
./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • The whole training process will cost about three hours with 8 TESLA V100 GPUs.
  • To train TeViT with ResNet-50 or Swin-L backbone, please download the COCO pretrained weights from QueryInst.

Acknowledgement ❀️

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation πŸ“ :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}
Owner
Hust Visual Learning Team
Hust Visual Learning Team belongs to the Artificial Intelligence Research Institute in the School of EIC in HUST, Lead by @xinggangw
Hust Visual Learning Team
[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers This is the official PyTorch implementation and models for UP-DETR paper: @a

dddzg 430 Dec 23, 2022
Summary of related papers on visual attention

This repo is built for paper: Attention Mechanisms in Computer Vision: A Survey paper Vision-Attention-Papers Channel attention Spatial attention Temp

MenghaoGuo 2.1k Dec 30, 2022
DL & CV-based indicator toolset for the vehicle drivers via live dash-cam footage.

Vehicle Indicator Toolset Deep Learning and Computer Vision based indicator toolset for vehicle drivers using live dash-cam footages. Tracking of vehi

Alex Xu 12 Dec 28, 2021
This is the official github repository of the Met dataset

The Met dataset This is the official github repository of the Met dataset. The official webpage of the dataset can be found here. What is it? This cod

Nikolaos-Antonios Ypsilantis 35 Dec 17, 2022
Multi Camera Calibration

Multi Camera Calibration 'modules/camera_calibration/app/camera_calibration.cpp' is for calculating extrinsic parameter of each individual cameras. 'm

7 Dec 01, 2022
[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games

Contextual Action Language Model (CALM) and the ClubFloyd Dataset Code and data for paper Keep CALM and Explore: Language Models for Action Generation

Princeton Natural Language Processing 43 Dec 16, 2022
R-Drop: Regularized Dropout for Neural Networks

R-Drop: Regularized Dropout for Neural Networks R-drop is a simple yet very effective regularization method built upon dropout, by minimizing the bidi

756 Dec 27, 2022
Official Implementation of SWAGAN: A Style-based Wavelet-driven Generative Model

Official Implementation of SWAGAN: A Style-based Wavelet-driven Generative Model SWAGAN: A Style-based Wavelet-driven Generative Model Rinon Gal, Dana

55 Dec 06, 2022
DeepHyper: Scalable Asynchronous Neural Architecture and Hyperparameter Search for Deep Neural Networks

What is DeepHyper? DeepHyper is a software package that uses learning, optimization, and parallel computing to automate the design and development of

DeepHyper Team 214 Jan 08, 2023
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation This repo is the official implementation of "MHFormer: Multi-Hypothesis Transforme

Vegetabird 281 Jan 07, 2023
Yolov5+SlowFast: Realtime Action Detection Based on PytorchVideo

Yolov5+SlowFast: Realtime Action Detection A realtime action detection frame work based on PytorchVideo. Here are some details about our modification:

WuFan 181 Dec 30, 2022
PyTorch implementation for 3D human pose estimation

Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach This repository is the PyTorch implementation for the network presented in:

Xingyi Zhou 579 Dec 22, 2022
This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Differentiable Volumetric Rendering Paper | Supplementary | Spotlight Video | Blog Entry | Presentation | Interactive Slides | Project Page This repos

697 Jan 06, 2023
Official Keras Implementation for UNet++ in IEEE Transactions on Medical Imaging and DLMIA 2018

UNet++: A Nested U-Net Architecture for Medical Image Segmentation UNet++ is a new general purpose image segmentation architecture for more accurate i

Zongwei Zhou 1.8k Dec 27, 2022
A state-of-the-art semi-supervised method for image recognition

Mean teachers are better role models Paper ---- NIPS 2017 poster ---- NIPS 2017 spotlight slides ---- Blog post By Antti Tarvainen, Harri Valpola (The

Curious AI 1.4k Jan 06, 2023
FCN (Fully Convolutional Network) is deep fully convolutional neural network architecture for semantic pixel-wise segmentation

FCN_via_Keras FCN FCN (Fully Convolutional Network) is deep fully convolutional neural network architecture for semantic pixel-wise segmentation. This

Kento Watanabe 48 Aug 30, 2022
Collection of generative models in Tensorflow

tensorflow-generative-model-collections Tensorflow implementation of various GANs and VAEs. Related Repositories Pytorch version Pytorch version of th

3.8k Dec 30, 2022
Code for the Image similarity challenge.

ISC 2021 This repository contains code for the Image Similarity Challenge 2021. Getting started The docs subdirectory has step-by-step instructions on

Facebook Research 173 Dec 12, 2022
This is the official implementation of our proposed SwinMR

SwinMR This is the official implementation of our proposed SwinMR: Swin Transformer for Fast MRI Please cite: @article{huang2022swin, title={Swi

A Yang Lab (led by Dr Guang Yang) 27 Nov 17, 2022
Crab is a flexible, fast recommender engine for Python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (numpy, scipy, matplotlib).

Crab - A Recommendation Engine library for Python Crab is a flexible, fast recommender engine for Python that integrates classic information filtering r

python-recsys 1.2k Dec 21, 2022