End-to-End Referring Video Object Segmentation with Multimodal Transformers

Related tags

Deep LearningMTTR
Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

License Framework

This repo contains the official implementation of the paper:


End-to-End Referring Video Object Segmentation with Multimodal Transformers

MTTR_preview.mp4

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files) 

###JHMDB-Sentences Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

  • Pytorch 1.10:

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

  • Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

  • COCO API (for mAP calculations):

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

  • Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

Command Description
-c path to dataset configuration file
-rm running mode (train/eval)
-ws window size
-bs training batch size per GPU
-ebs eval batch size per GPU (if not provided, training batch size is used)
-ng number of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 46.1
8 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 39.2
8 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window Size Command Checkpoint File Output Zip J&F Result
12 python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8 Link Link 55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

  • The command for the following configuration was tested on 2 A6000 48GB GPUs:
Window Size Command
10 python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2
  • The command for the following configuration was tested on 3 RTX 3090 24GB GPUs:
Window Size Command
8 python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3

Refer-YouTube-VOS

  • The command for the following configuration was tested on 4 A6000 48GB GPUs:
Window Size Command
12 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4
  • The command for the following configuration was tested on 8 RTX 3090 24GB GPUs.
Window Size Command
8 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

AWS documentation corpus for zero-shot open-book question answering.

aws-documentation We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions a

Sia Gholami 2 Jul 07, 2022
A nutritional label for food for thought.

Lexiscore As a first effort in tackling the theme of information overload in content consumption, I've been working on the lexiscore: a nutritional la

Paul Bricman 34 Nov 08, 2022
Implementation of PyTorch-based multi-task pre-trained models

mtdp Library containing implementation related to the research paper "Multi-task pre-training of deep neural networks for digital pathology" (Mormont

Romain Mormont 27 Oct 14, 2022
Implementation of "Bidirectional Projection Network for Cross Dimension Scene Understanding" CVPR 2021 (Oral)

Bidirectional Projection Network for Cross Dimension Scene Understanding CVPR 2021 (Oral) [ Project Webpage ] [ arXiv ] [ Video ] Existing segmentatio

Hu Wenbo 135 Dec 26, 2022
Research code of ICCV 2021 paper "Mesh Graphormer"

MeshGraphormer ✨ ✨ This is our research code of Mesh Graphormer. Mesh Graphormer is a new transformer-based method for human pose and mesh reconsructi

Microsoft 251 Jan 08, 2023
Modified fork of Xuebin Qin's U-2-Net Repository. Used for demonstration purposes.

U^2-Net (U square net) Modified version of U2Net used for demonstation purposes. Paper: U^2-Net: Going Deeper with Nested U-Structure for Salient Obje

Shreyas Bhat Kera 13 Aug 28, 2022
Learning to Stylize Novel Views

Learning to Stylize Novel Views [Project] [Paper] Contact: Hsin-Ping Huang ([ema

34 Nov 27, 2022
Bravia core script for python

Bravia-Core-Script You need to have a mandatory account If this L3 does not work, try another L3. enjoy

5 Dec 26, 2021
MMDetection3D is an open source object detection toolbox based on PyTorch

MMDetection3D is an open source object detection toolbox based on PyTorch, towards the next-generation platform for general 3D detection. It is a part of the OpenMMLab project developed by MMLab.

OpenMMLab 3.2k Jan 05, 2023
[TNNLS 2021] The official code for the paper "Learning Deep Context-Sensitive Decomposition for Low-Light Image Enhancement"

CSDNet-CSDGAN this is the code for the paper "Learning Deep Context-Sensitive Decomposition for Low-Light Image Enhancement" Environment Preparing pyt

Jiaao Zhang 17 Nov 05, 2022
PyZebrascope - an open-source Python platform for brain-wide neural activity imaging in behaving zebrafish

PyZebrascope - an open-source Python platform for brain-wide neural activity imaging in behaving zebrafish

1 May 31, 2022
DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021)

Evaluation, Training, Demo, and Inference of DeFMO DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021) Denys Rozumnyi, Martin R. O

Denys Rozumnyi 139 Dec 26, 2022
Semi-Supervised Graph Prototypical Networks for Hyperspectral Image Classification, IGARSS, 2021.

Semi-Supervised Graph Prototypical Networks for Hyperspectral Image Classification, IGARSS, 2021. Bobo Xi, Jiaojiao Li, Yunsong Li and Qian Du. Code f

Bobo Xi 7 Nov 03, 2022
PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

Xinlei-Pei 6 Dec 23, 2022
NICE-GAN — Official PyTorch Implementation Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation

NICE-GAN-pytorch - Official PyTorch implementation of NICE-GAN: Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation

Runfa Chen 208 Nov 25, 2022
Deep Convolutional Generative Adversarial Networks

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Alec Radford, Luke Metz, Soumith Chintala All images in t

Alec Radford 3.4k Dec 29, 2022
Hunt down social media accounts by username across social networks

Hunt down social media accounts by username across social networks Installation | Usage | Docker Notes | Contributing Installation # clone the repo $

1 Dec 14, 2021
Defending graph neural networks against adversarial attacks (NeurIPS 2020)

GNNGuard: Defending Graph Neural Networks against Adversarial Attacks Authors: Xiang Zhang ( Zitnik Lab @ Harvard 44 Dec 07, 2022

VGG16 model-based classification project about brain tumor detection.

Brain-Tumor-Classification-with-MRI VGG16 model-based classification project about brain tumor detection. First, you can check what people are doing o

Atakan Erdoğan 2 Mar 21, 2022
Heterogeneous Deep Graph Infomax

Heterogeneous-Deep-Graph-Infomax Parameter Setting: HDGI-A: Node-level dimension: 16 Attention head: 4 Semantic-level attention vector: 8 learning rat

52 Oct 31, 2022