[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

Overview

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition

Paper | Model Checkpoint

  • This is the official PyTorch implementation of Collaborative Transformers for Grounded Situation Recognition.
  • CoFormer (Collaborative Glance-Gaze TransFormer) achieves state-of-the-art accuracy in every evaluation metric on the SWiG dataset.
  • This repository contains instructions, code and model checkpoint.

prediction_results


Overview

Grounded situation recognition is the task of predicting the main activity, entities playing certain roles within the activity, and bounding-box groundings of the entities in the given image. To effectively deal with this challenging task, we introduce a novel approach where the two processes for activity classification and entity estimation are interactive and complementary. To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation. Glance transformer predicts the main activity with the help of Gaze transformer that analyzes entities and their relations, while Gaze transformer estimates the grounded entities by focusing only on the entities relevant to the activity predicted by Glance transformer. Our CoFormer achieves the state of the art in all evaluation metrics on the SWiG dataset.

overall_architecture Following conventions in the literature, we call an activity verb and an entity noun. Glance transformer predicts a verb with the help of Gaze-Step1 transformer that analyzes nouns and their relations by leveraging role features, while Gaze-Step2 transformer estimates the grounded nouns for the roles associated with the predicted verb. Prediction results are obtained by feed forward networks (FFNs).

Environment Setup

We provide instructions for environment setup.

# Clone this repository and navigate into the repository
git clone https://github.com/jhcho99/CoFormer.git    
cd CoFormer                                          

# Create a conda environment, activate the environment and install PyTorch via conda
conda create --name CoFormer python=3.9              
conda activate CoFormer                             
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge 

# Install requirements via pip
pip install -r requirements.txt                   

SWiG Dataset

Annotations are given in JSON format, and annotation files are under "SWiG/SWiG_jsons/" directory. Images can be downloaded here. Please download the images and store them in "SWiG/images_512/" directory.

In the SWiG dataset, each image is associated with Verb, Frame and Groundings.

A) Verb: each image is paired with a verb. In the annotation file, "verb" denotes the salient action for an image.

B) Frame: a frame denotes the set of semantic roles for a verb. For example, the frame for verb "Drinking" denotes the set of semantic roles "Agent", "Liquid", "Container" and "Place". In the annotation file, "frames" show the set of semantic roles for a verb, and noun annotations for each role. There are three noun annotations for each role, which are given by three different annotators.

C) Groundings: each grounding is described in [x1, y1, x2, y2] format. In the annotation file, "bb" denotes bounding-box groundings for roles. Note that nouns can be labeled without groundings, e.g., in the case of occluded objects. When there is no grounding for a role, [-1, -1, -1, -1] is given.

# an example of annotation for an image

"drinking_235.jpg": {
    "verb": "drinking",
    "height": 512, 
    "width": 657, 
    "bb": {"agent": [0, 1, 654, 512], 
           "liquid": [128, 273, 293, 382], 
           "container": [111, 189, 324, 408],
           "place": [-1, -1, -1, -1]},
    "frames": [{"agent": "n10787470", "liquid": "n14845743", "container": "n03438257", "place": ""}, 
               {"agent": "n10129825", "liquid": "n14845743", "container": "n03438257", "place": ""}, 
               {"agent": "n10787470", "liquid": "n14845743", "container": "n03438257", "place": ""}]
    }

In imsitu_space.json file, there is additional information for verb and noun.

# an example of additional verb information

"drinking": {
    "framenet": "Ingestion", 
    "abstract": "the AGENT drinks a LIQUID from a CONTAINER at a PLACE", 
    "def": "take (a liquid) into the mouth and swallow", 
    "order": ["agent", "liquid", "container", "place"], 
    "roles": {"agent": {"framenet": "ingestor", "def": "The entity doing the drink action"},
              "liquid": {"framenet": "ingestibles", "def": "The entity that the agent is drinking"}
              "container": {"framenet": "source", "def": "The container in which the liquid is in"}        
              "place": {"framenet": "place", "def": "The location where the drink event is happening"}}
    }
# an example of additional noun information

"n14845743": {
    "gloss": ["water", "H2O"], 
    "def": "binary compound that occurs at room temperature as a clear colorless odorless tasteless liquid; freezes into ice below 0 degrees centigrade and boils above 100 degrees centigrade; widely used as a solvent"
    }

Additional Details

  • All images should be under "SWiG/images_512/" directory.
  • train.json file is for train set.
  • dev.json file is for development set.
  • test.json file is for test set.

Training

To train CoFormer on a single node with 4 GPUs for 40 epochs, run:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
           --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
           --num_workers 4 --num_glance_enc_layers 3 --num_gaze_s1_dec_layers 3 \
           --num_gaze_s1_enc_layers 3 --num_gaze_s2_dec_layers 3 --dropout 0.15 --hidden_dim 512 \
           --output_dir CoFormer

To train CoFormer on a Slurm cluster with submitit using 4 RTX 3090 GPUs for 40 epochs, run:

python run_with_submitit.py --ngpus 4 --nodes 1 --job_dir CoFormer \
        --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
        --num_workers 4 --num_glance_enc_layers 3 --num_gaze_s1_dec_layers 3 \
        --num_gaze_s1_enc_layers 3 --num_gaze_s2_dec_layers 3 --dropout 0.15 --hidden_dim 512 \
        --partition rtx3090
  • A single epoch takes about 45 minutes. Training CoFormer for 40 epochs takes around 30 hours on a single machine with 4 RTX 3090 GPUs.
  • We use AdamW optimizer with learning rate 10-4 (10-5 for backbone), weight decay 10-4 and β = (0.9, 0.999).
    • Those learning rates are divided by 10 at epoch 30.
  • Random Color Jittering, Random Gray Scaling, Random Scaling and Random Horizontal Flipping are used for augmentation.

Evaluation

To evaluate CoFormer on the dev set with the saved model, run:

python main.py --saved_model CoFormer_checkpoint.pth --output_dir CoFormer --dev

To evaluate CoFormer on the test set with the saved model, run:

python main.py --saved_model CoFormer_checkpoint.pth --output_dir CoFormer --test
  • Model checkpoint can be downloaded here.

Inference

To run an inference on a custom image, run:

python inference.py --image_path inference/filename.jpg \
                    --saved_model CoFormer_checkpoint.pth \
                    --output_dir inference

Results

We provide several experimental results.

quantitative qualitative_1 qualitative_2

Our Previous Work

We proposed GSRTR for this task using a simple transformer encoder-decoder architecture:

Acknowledgements

Our code is modified and adapted from these amazing repositories:

Contact

Junhyeong Cho ([email protected])

Citation

If you find our work useful for your research, please cite our paper:

@InProceedings{cho2022CoFormer,
    title={Collaborative Transformers for Grounded Situation Recognition},
    author={Junhyeong Cho and Youngseok Yoon and Suha Kwak},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2022}
}

License

CoFormer is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Owner
Junhyeong Cho
Studied @ POSTECH, Stanford, UIUC, UC Berkeley
Junhyeong Cho
Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation (CVPR 2021)

Implicit3DUnderstanding (Im3D) [Project Page] Holistic 3D Scene Understanding from a Single Image with Implicit Representation Cheng Zhang, Zhaopeng C

Cheng Zhang 149 Jan 08, 2023
this is a lite easy to use virtual keyboard project for anyone to use

virtual_Keyboard this is a lite easy to use virtual keyboard project for anyone to use motivation I made this for this year's recruitment for RobEn AA

Mohamed Emad 3 Oct 23, 2021
Pytorch implementation of paper: "NeurMiPs: Neural Mixture of Planar Experts for View Synthesis"

NeurMips: Neural Mixture of Planar Experts for View Synthesis This is the official repo for PyTorch implementation of paper "NeurMips: Neural Mixture

James Lin 101 Dec 13, 2022
A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

Description A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization co

AoxiangFan 9 Nov 10, 2022
Viewmaker Networks: Learning Views for Unsupervised Representation Learning

Viewmaker Networks: Learning Views for Unsupervised Representation Learning Alex Tamkin, Mike Wu, and Noah Goodman Paper link: https://arxiv.org/abs/2

Alex Tamkin 31 Dec 01, 2022
A highly modular PyTorch framework with a focus on Neural Architecture Search (NAS).

UniNAS A highly modular PyTorch framework with a focus on Neural Architecture Search (NAS). under development (which happens mostly on our internal Gi

Cognitive Systems Research Group 19 Nov 23, 2022
A crossplatform menu bar application using mpv as DLNA Media Renderer.

Macast Chinese README A menu bar application using mpv as DLNA Media Renderer. Install MacOS || Windows || Debian Download link: Macast release latest

4.4k Jan 01, 2023
This is the latest version of the PULP SDK

PULP-SDK This is the latest version of the PULP SDK, which is under active development. The previous (now legacy) version, which is no longer supporte

78 Dec 07, 2022
[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.

FFB6D This is the official source code for the CVPR2021 Oral work, FFB6D: A Full Flow Biderectional Fusion Network for 6D Pose Estimation. (Arxiv) Tab

Yisheng (Ethan) He 201 Dec 28, 2022
Official implementation of "Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform", ICCV 2021

Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform This repository is the implementation of "Variable-Rate Deep Image C

Myungseo Song 47 Dec 13, 2022
The implementation our EMNLP 2021 paper "Enhanced Language Representation with Label Knowledge for Span Extraction".

LEAR The implementation our EMNLP 2021 paper "Enhanced Language Representation with Label Knowledge for Span Extraction". See below for an overview of

杨攀 93 Jan 07, 2023
Machine learning, in numpy

numpy-ml Ever wish you had an inefficient but somewhat legible collection of machine learning algorithms implemented exclusively in NumPy? No? Install

David Bourgin 11.6k Dec 30, 2022
Image super-resolution through deep learning

srez Image super-resolution through deep learning. This project uses deep learning to upscale 16x16 images by a 4x factor. The resulting 64x64 images

David Garcia 5.3k Dec 28, 2022
Provide partial dates and retain the date precision through processing

Prefix date parser This is a helper class to parse dates with varied degrees of precision. For example, a data source might state a date as 2001, 2001

Friedrich Lindenberg 13 Dec 14, 2022
Paddle pit - Rethinking Spatial Dimensions of Vision Transformers

基于Paddle实现PiT ——Rethinking Spatial Dimensions of Vision Transformers,arxiv 官方原版代

Hongtao Wen 4 Jan 15, 2022
This script runs neural style transfer against the provided content image.

Neural Style Transfer Content Style Output Description: This script runs neural style transfer against the provided content image. The content image m

Martynas Subonis 0 Nov 25, 2021
K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce (EMNLP Founding 2021)

Introduction K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce. Installation PyTor

Xu Song 21 Nov 16, 2022
Additional environments compatible with OpenAI gym

Decentralized Control of Quadrotor Swarms with End-to-end Deep Reinforcement Learning A codebase for training reinforcement learning policies for quad

Zhehui Huang 40 Dec 06, 2022
This repository contains the source codes for the paper AtlasNet V2 - Learning Elementary Structures.

AtlasNet V2 - Learning Elementary Structures This work was build upon Thibault Groueix's AtlasNet and 3D-CODED projects. (you might want to have a loo

Théo Deprelle 123 Nov 11, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023