Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang1,3, Xinggang Wang1 πŸ“§ , Yu Li4, Yuxin Fang1, Jiemin Fang1,2, Wenyu Liu1, Xun Zhao3, Ying Shan3.

1 School of EIC, HUST, 2 AIA, HUST, 3 ARC Lab, Tencent PCG, 4 IDEA.

( πŸ“§ ) corresponding author.


  • This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
  • TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., QueryInst.
  • We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Overall Arch

Models and Main Results

  • We provide both checkpoints and codalab server submissions on YouTube-VIS-2019 dataset.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_MsgShifT 46.3 70.6 50.9 45.2 54.3 link link
TeViT_MsgShifT_MST 46.9 70.1 52.9 45.0 53.4 link link
  • We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
  • Besides basic models, we also provide TeViT with ResNet-50 and Swin-L backbone, models are also trained on YouTube-VIS-2019 dataset.
  • MST denotes multi-scale traning.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_R50 42.1 67.8 44.8 41.3 49.9 link link
TeViT_Swin-L_MST 56.8 80.6 63.1 52.0 63.3 link link
  • Due to backbone limitations, TeViT models with ResNet-50 and Swin-L backbone are conducted with STQI Head only (i.e., without our proposed messenger shift mechanism).
  • With Swin-L as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.

Installation

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Prepare

  • Clone the repository locally:
git clone https://github.com/hustvl/TeViT.git
  • Create a conda virtual environment and activate it:
conda create --name tevit python=3.7.7
conda activate tevit
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI
  • Install Python requirements
torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt
  • Please follow Docs to install MMDetection
python setup.py develop
  • Download YouTube-VIS 2019 dataset from here, and organize dataset as follows:
TeViT
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ youtubevis
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   β”œβ”€β”€ 003234408d
β”‚   β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ val
β”‚   β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ train.json
β”‚   β”‚   β”‚   β”œβ”€β”€ valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

  • Download the COCO pretrained QueryInst with PVT-B1 backbone from here.
  • Train TeViT with 8 GPUs:
./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • Train TeViT with multi-scale data augmentation:
./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • The whole training process will cost about three hours with 8 TESLA V100 GPUs.
  • To train TeViT with ResNet-50 or Swin-L backbone, please download the COCO pretrained weights from QueryInst.

Acknowledgement ❀️

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation πŸ“ :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}
Owner
Hust Visual Learning Team
Hust Visual Learning Team belongs to the Artificial Intelligence Research Institute in the School of EIC in HUST, Lead by @xinggangw
Hust Visual Learning Team
A stable algorithm for GAN training

DRAGAN (Deep Regret Analytic Generative Adversarial Networks) Link to our paper - https://arxiv.org/abs/1705.07215 Pytorch implementation (thanks!) -

195 Oct 10, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

79 Dec 27, 2022
Optimizing Value-at-Risk and Conditional Value-at-Risk of Black Box Functions with Lacing Values (LV)

BayesOpt-LV Optimizing Value-at-Risk and Conditional Value-at-Risk of Black Box Functions with Lacing Values (LV) About This repository contains the s

1 Nov 11, 2021
DeepMind's software stack for physics-based simulation and Reinforcement Learning environments, using MuJoCo.

dm_control: DeepMind Infrastructure for Physics-Based Simulation. DeepMind's software stack for physics-based simulation and Reinforcement Learning en

DeepMind 3k Dec 31, 2022
Using machine learning to predict undergrad college admissions.

College-Prediction Project- Overview: Many have tried, many have failed. Few trailblazers are ambitious enought to chase acceptance into the top 15 un

John H Klinges 1 Jan 05, 2022
A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

Ivan Anishchenko 185 Jan 07, 2023
(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

xxxnell 656 Dec 30, 2022
Count the MACs / FLOPs of your PyTorch model.

THOP: PyTorch-OpCounter How to install pip install thop (now continously intergrated on Github actions) OR pip install --upgrade git+https://github.co

Ligeng Zhu 3.9k Dec 29, 2022
Converts given image (png, jpg, etc) to amogus gif.

Image to Amogus Converter Converts given image (.png, .jpg, etc) to an amogus gif! Usage Place image in the /target/ folder (or anywhere realistically

Hank Magan 1 Nov 24, 2021
BanditPAM: Almost Linear-Time k-Medoids Clustering

BanditPAM: Almost Linear-Time k-Medoids Clustering This repo contains a high-performance implementation of BanditPAM from BanditPAM: Almost Linear-Tim

254 Dec 12, 2022
This is the official implementation of Elaborative Rehearsal for Zero-shot Action Recognition (ICCV2021)

Elaborative Rehearsal for Zero-shot Action Recognition This is an official implementation of: Shizhe Chen and Dong Huang, Elaborative Rehearsal for Ze

DeLightCMU 26 Sep 24, 2022
Paper: Cross-View Kernel Similarity Metric Learning Using Pairwise Constraints for Person Re-identification

Cross-View Kernel Similarity Metric Learning Using Pairwise Constraints for Person Re-identification T M Feroz Ali, Subhasis Chaudhuri, ICVGIP-20-21

T M Feroz Ali 3 Jun 17, 2022
[WACV21] Code for our paper: Samuel, Atzmon and Chechik, "From Generalized zero-shot learning to long-tail with class descriptors"

DRAGON: From Generalized zero-shot learning to long-tail with class descriptors Paper Project Website Video Overview DRAGON learns to correct the bias

Dvir Samuel 25 Dec 06, 2022
AgeGuesser: deep learning based age estimation system. Powered by EfficientNet and Yolov5

AgeGuesser AgeGuesser is an end-to-end, deep-learning based Age Estimation system, presented at the CAIP 2021 conference. You can find the related pap

5 Nov 10, 2022
CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

Myeongjun Kim 52 Jan 07, 2023
Code for layerwise detection of linguistic anomaly paper (ACL 2021)

Layerwise Anomaly This repository contains the source code and data for our ACL 2021 paper: "How is BERT surprised? Layerwise detection of linguistic

6 Dec 07, 2022
Detecting drunk people through thermal images using Deep Learning (CNN)

Drunk Detection CNN Detecting drunk people through thermal images using Deep Learning (CNN) Dataset We used thermal images provided by Electronics Lab

Giacomo Ferretti 3 Oct 27, 2022
Adjust Decision Boundary for Class Imbalanced Learning

Adjusting Decision Boundary for Class Imbalanced Learning This repository is the official PyTorch implementation of WVN-RS, introduced in Adjusting De

Peyton Byungju Kim 16 Jan 04, 2023
TensorFlow implementation of Style Transfer Generative Adversarial Networks: Learning to Play Chess Differently.

Adversarial Chess TensorFlow implementation of Style Transfer Generative Adversarial Networks: Learning to Play Chess Differently. Requirements To run

Muthu Chidambaram 30 Sep 07, 2021
ColBERT: Contextualized Late Interaction over BERT (SIGIR'20)

Update: if you're looking for ColBERTv2 code, you can find it alongside a new simpler API, in the branch new_api. ColBERT ColBERT is a fast and accura

Stanford Future Data Systems 637 Jan 08, 2023