[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Last update: Dec 27, 2022

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

This repository provides the code for our paper. This includes:

Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
Training scripts and pretrained checkpoints
Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

To remove time encoding, add --no_time_embed.
To remove the temporal self-attention in the space-time decoder, add --no_tsa.
To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
To train with the slow-only variant, add --no_fast.
To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data	parameters	url	size
MDETR init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + VidSTG	k=2 res=224	Drive	3.0GB
ImageNet init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=2 res=224	Drive	3.0GB
MDETR init + HC-STVG1	k=4 res=352	Drive	3.0GB
ImageNet init + HC-STVG1	k=4 res=352	Drive	3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Related tags

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Setup

Data Downloading

Data Preprocessing

Training

Available Checkpoints

Evaluation

Spatio-Temporal Video Grounding Demo

Acknowledgements

Citation

Owner

Antoine Yang

The codes and related files to reproduce the results for Image Similarity Challenge Track 2.

Learning to See by Looking at Noise

RLMeta is a light-weight flexible framework for Distributed Reinforcement Learning Research.

Find-Lane-Line - Use openCV library and Python to detect the road-lane-line

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

Semantic similarity computation with different state-of-the-art metrics

MRI reconstruction (e.g., QSM) using deep learning methods

Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

An excellent hash algorithm combining classical sponge structure and RNN.

Compartmental epidemic model to assess undocumented infections: applications to SARS-CoV-2 epidemics in Brazil - Datasets and Codes

In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

Uni-Fold: Training your own deep protein-folding models.

A PyTorch Implementation of ViT (Vision Transformer)

Controlling Hill Climb Racing with Hand Tacking

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Apply AnimeGAN-v2 across frames of a video clip

Python package for missing-data imputation with deep learning

Implementation of the state-of-the-art vision transformers with tensorflow