PyTorch implementation of a collections of scalable Video Transformer Benchmarks.

Overview

PyTorch implementation of Video Transformer Benchmarks

This repository is mainly built upon Pytorch and Pytorch-Lightning. We wish to maintain a collections of scalable video transformer benchmarks, and discuss the training recipes of how to train a big video transformer model.

Now, we implement the TimeSformer and ViViT. And we have pre-trained the TimeSformer-B on Kinetics600, but still can't guarantee the performance reported in the paper. However, we find some relevant hyper-parameters which may help us to reach the target performance.

Table of Contents

  1. Difference
  2. TODO
  3. Setup
  4. Usage
  5. Result
  6. Acknowledge
  7. Contribution

Difference

In order to share the basic divided spatial-temporal attention module to different video transformer, we make some changes in the following apart.

1. Position embedding

We split the position embedding from R(nt*h*w×d) mentioned in the ViViT paper into R(nh*w×d) and R(nt×d) to stay the same as TimeSformer.

2. Class token

In order to make clear whether to add the class_token into the module forward computation, we only compute the interaction between class_token and query when the current layer is the last layer (except FFN) of each transformer block.

3. Initialize from the pre-trained model

  • Tokenization: the token embedding filter can be chosen either Conv2D or Conv3D, and the initializing weights of Conv3D filters from Conv2D can be replicated along temporal dimension and averaging them or initialized with zeros along the temporal positions except at the center t/2.
  • Temporal MSA module weights: one can choose to copy the weights from spatial MSA module or initialize all weights with zeros.
  • Initialize from the MAE pre-trained model provided by ZhiLiang, where the class_token that does not appear in the MAE pre-train model is initialized from truncated normal distribution.
  • Initialize from the ViT pre-trained model can be found here.

TODO

  • add more TimeSformer and ViViT variants pre-trained weights.
    • A larger version and other operation types.
  • add linear prob and partial fine-tune.
    • Make available to transfer the pre-trained model to downstream task.
  • add more scalable Video Transformer benchmarks.
    • We will also extend to multi-modality version, e.g Perceiver is coming soon.
  • add more diverse objective functions.
    • Pre-train on larger dataset through the dominated self-supervised methods, e.g Contrastive Learning and MAE.

Setup

pip install -r requirements.txt

Usage

Training

# path to Kinetics600 train set
TRAIN_DATA_PATH='/path/to/Kinetics600/train_list.txt'
# path to root directory
ROOT_DIR='/path/to/work_space'

python model_pretrain.py \
	-lr 0.005 \
	-pretrain 'vit' \
	-epoch 15 \
	-batch_size 8 \
	-num_class 600 \
	-frame_interval 32 \
	-root_dir ROOT_DIR \
	-train_data_path TRAIN_DATA_PATH

The minimal folder structure will look like as belows.

root_dir
├── pretrain_model
│   ├── pretrain_mae_vit_base_mask_0.75_400e.pth
│   ├── vit_base_patch16_224.pth
├── results
│   ├── experiment_tag
│   │   ├── ckpt
│   │   ├── log

Inference

# path to Kinetics600 pre-trained model
PRETRAIN_PATH='/path/to/pre-trained model'
# path to the test video sample
VIDEO_PATH='/path/to/video sample'

python model_inference.py \
	-pretrain PRETRAIN_PATH \
	-video_path VIDEO_PATH \
	-num_frames 8 \
	-frame_interval 32 \

Result

Kinetics-600

1. Model Zoo

name pretrain epochs num frames spatial crop top1_acc top5_acc weight log
TimeSformer-B ImageNet-21K 15e 8 224 78.4 93.6 Google drive or BaiduYun(code: yr4j) log

2. Train Recipe(ablation study)

2.1 Acc

operation top1_acc top5_acc top1_acc (three crop)
base 68.2 87.6 -
+ frame_interval 4 -> 16 (span more time) 72.9(+4.7) 91.0(+3.4) -
+ RandomCrop, flip (overcome overfit) 75.7(+2.8) 92.5(+1.5) -
+ batch size 16 -> 8 (more iterations) 75.8(+0.1) 92.4(-0.1) -
+ frame_interval 16 -> 24 (span more time) 77.7(+1.9) 93.3(+0.9) 78.4
+ frame_interval 24 -> 32 (span more time) 78.4(+0.7) 94.0(+0.7) 79.1

tips: frame_interval and data augment counts for the validation accuracy.


2.2 Time

operation epoch_time
base (start with DDP) 9h+
+ speed up training recipes 1h+
+ switch from get_batch first to sample_Indice first 0.5h
+ batch size 16 -> 8 33.32m
+ num_workers 8 -> 4 35.52m
+ frame_interval 16 -> 24 44.35m

tips: Improve the frame_interval will drop a lot on time performance.

1.speed up training recipes:

  • More GPU device.
  • pin_memory=True.
  • Avoid CPU->GPU Device transfer (such as .item(), .numpy(), .cpu() operations on tensor or log to disk).

2.get_batch first means that we firstly read all frames through the video reader, and then get the target slice of frames, so it largely slow down the data-loading speed.


Acknowledge

this repo is built on top of Pytorch-Lightning, decord and kornia. I also learn many code designs from MMaction2. I thank the authors for releasing their code.

Contribution

I look forward to seeing one can provide some ideas about the repo, please feel free to report it in the issue, or even better, submit a pull request.

And your star is my motivation, thank u~

Owner
Xin Ma
Xin Ma
DNA-RECON { Automatic Web Reconnaissance Tool }

ABOUT TOOL : DNA-RECON is an automatic web reconnaissance tool written in python. This tool made for reconnaissance and information gathering with an

NIKUNJ BHATT 25 Aug 11, 2021
CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning This repository contains the code and relevant instructions

XiaoMing 5 Aug 19, 2022
Virtual hand gesture mouse using a webcam

NonMouse 日本語のREADMEはこちら This is an application that allows you to use your hand itself as a mouse. The program uses a web camera to recognize your han

Yuki Takeyama 55 Jan 01, 2023
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
A framework for Quantification written in Python

QuaPy QuaPy is an open source framework for quantification (a.k.a. supervised prevalence estimation, or learning to quantify) written in Python. QuaPy

41 Dec 14, 2022
Domain Generalization with MixStyle, ICLR'21.

MixStyle This repo contains the code of our ICLR'21 paper, "Domain Generalization with MixStyle". The OpenReview link is https://openreview.net/forum?

Kaiyang 208 Dec 28, 2022
PyTorch Implementation for Deep Metric Learning Pipelines

Easily Extendable Basic Deep Metric Learning Pipeline Karsten Roth ([email 

Karsten Roth 543 Jan 04, 2023
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

Abhay Gupta 161 Dec 08, 2022
Code for the paper "Adversarial Generator-Encoder Networks"

This repository contains code for the paper "Adversarial Generator-Encoder Networks" (AAAI'18) by Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky. Pr

Dmitry Ulyanov 279 Jun 26, 2022
Plug and play transformer you can find network structure and official complete code by clicking List

Plug-and-play Module Plug and play transformer you can find network structure and official complete code by clicking List The following is to quickly

8 Mar 27, 2022
Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022

MetaMorph: Learning Universal Controllers with Transformers This is the code for the paper MetaMorph: Learning Universal Controllers with Transformers

Agrim Gupta 50 Jan 03, 2023
Material for my PyConDE & PyData Berlin 2022 Talk "5 Steps to Speed Up Your Data-Analysis on a Single Core"

5 Steps to Speed Up Your Data-Analysis on a Single Core Material for my talk at the PyConDE & PyData Berlin 2022 Description Your data analysis pipeli

Jonathan Striebel 9 Dec 12, 2022
Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019

PoseNet of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image" Introduction This repo is official Py

Gyeongsik Moon 677 Dec 25, 2022
[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

Rethinking Minimal Sufficient Representation in Contrastive Learning PyTorch implementation of Rethinking Minimal Sufficient Representation in Contras

36 Nov 23, 2022
MATLAB codes of the book "Digital Image Processing Fourth Edition" converted to Python

Digital Image Processing Python MATLAB codes of the book "Digital Image Processing Fourth Edition" converted to Python TO-DO: Refactor scripts, curren

Merve Noyan 24 Oct 16, 2022
On Out-of-distribution Detection with Energy-based Models

On Out-of-distribution Detection with Energy-based Models This repository contains the code for the experiments conducted in the paper On Out-of-distr

Sven 19 Aug 07, 2022
The Agriculture Domain of ERPNext comes with features to record crops and land

Agriculture The Agriculture Domain of ERPNext comes with features to record crops and land, track plant, soil, water, weather analytics, and even trac

Frappe 21 Jan 02, 2023
Tool for installing and updating MiSTer cores and other files

MiSTer Downloader This tool installs and updates all the cores and other extra files for your MiSTer. It also updates the menu core, the MiSTer firmwa

72 Dec 24, 2022
Learning to Draw: Emergent Communication through Sketching

Learning to Draw: Emergent Communication through Sketching This is the official code for the paper "Learning to Draw: Emergent Communication through S

19 Jul 22, 2022
Unofficial PyTorch Implementation of AHDRNet (CVPR 2019)

AHDRNet-PyTorch This is the PyTorch implementation of Attention-guided Network for Ghost-free High Dynamic Range Imaging (CVPR 2019). The official cod

Yutong Zhang 4 Sep 08, 2022