PyTorch implementation of our paper: Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

Overview

Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition, arxiv

This is a PyTorch implementation of our paper.

1. Requirements

torch>=1.7.0; torchvision>=0.8.0; Visdom(optional)

data prepare: Database with the following folder structure:

│NTURGBD/
├──dataset_splits/
│  ├── @CS
│  │   ├── train.txt
                video name               total frames    label
│  │   │    ├──S001C001P001R001A001_rgb      103          0 
│  │   │    ├──S001C001P001R001A004_rgb      99           3 
│  │   │    ├──...... 
│  │   ├── valid.txt
│  ├── @CV
│  │   ├── train.txt
│  │   ├── valid.txt
├──Images/
│  │   ├── S001C002P001R001A002_rgb
│  │   │   ├──000000.jpg
│  │   │   ├──000001.jpg
│  │   │   ├──......
├──nturgb+d_depth_masked/
│  │   ├── S001C002P001R001A002
│  │   │   ├──MDepth-00000000.png
│  │   │   ├──MDepth-00000001.png
│  │   │   ├──......

It is important to note that due to the RGB video resolution in the NTU dataset is relatively high, so we are not directly to resize the image from the original resolution to 320x240, but first crop the object-centered ROI area (640x480), and then resize it to 320x240 for training and testing.

2. Methodology

We propose to decouple and recouple spatiotemporal representation for RGB-D-based motion recognition. The Figure in the first line illustrates the proposed multi-modal spatiotemporal representation learning framework. The RGB-D-based motion recognition can be described as spatiotemporal information decoupling modeling, compact representation recoupling learning, and cross-modal representation interactive learning. The Figure in the second line shows the process of decoupling and recoupling saptiotemporal representation of a unimodal data.

3. Train and Evaluate

All of our models are pre-trained on the 20BN Jester V1 dataset and the pretrained model can be download here. Before cross-modal representation interactive learning, we first separately perform unimodal representation learning on RGB and depth data modalities.

Unimodal Training

Take training an RGB model with 8 GPUs on the NTU-RGBD dataset as an example, some basic configuration:

common:
  dataset: NTU 
  batch_size: 6
  test_batch_size: 6
  num_workers: 6
  learning_rate: 0.01
  learning_rate_min: 0.00001
  momentum: 0.9
  weight_decay: 0.0003
  init_epochs: 0
  epochs: 100
  optim: SGD
  scheduler:
    name: cosin                     # Represent decayed learning rate with the cosine schedule
    warm_up_epochs: 3 
  loss:
    name: CE                        # cross entropy loss function
    labelsmooth: True
  MultiLoss: True                   # Enable multi-loss training strategy.
  loss_lamdb: [ 1, 0.5, 0.5, 0.5 ]  # The loss weight coefficient assigned for each sub-branch.
  distill: 1.                       # The loss weight coefficient assigned for distillation task.

model:
  Network: I3DWTrans                # I3DWTrans represent unimodal training, set FusionNet for multi-modal fusion training.
  sample_duration: 64               # Sampled frames in a video.
  sample_size: 224                  # The image is croped into 224x224.
  grad_clip: 5.
  SYNC_BN: 1                        # Utilize SyncBatchNorm.
  w: 10                             # Sliding window size.
  temper: 0.5                       # Distillation temperature setting.
  recoupling: True                  # Enable recoupling strategy during training.
  knn_attention: 0.7                # Hyperparameter used in k-NN attention: selecting Top-70% tokens.
  sharpness: True                   # Enable sharpness for each sub-branch's output.
  temp: [ 0.04, 0.07 ]              # Temperature parameter follows a cosine schedule from 0.04 to 0.07 during the training.
  frp: True                         # Enable FRP module.
  SEHeads: 1                        # Number of heads used in RCM module.
  N: 6                              # Number of Transformer blochs configured for each sub-branch.

dataset:
  type: M                           # M: RGB modality, K: Depth modality.
  flip: 0.5                         # Horizontal flip.
  rotated: 0.5                      # Horizontal rotation
  angle: (-10, 10)                  # Rotation angle
  Blur: False                       # Enable random blur operation for each video frame.
  resize: (320, 240)                # The input is spatially resized to 320x240 for NTU dataset.
  crop_size: 224                
  low_frames: 16                    # Number of frames sampled for small Transformer.       
  media_frames: 32                  # Number of frames sampled for medium Transformer.  
  high_frames: 48                   # Number of frames sampled for large Transformer.
bash run.sh tools/train.py config/NTU.yml 0,1,2,3,4,5,6,7 8

or

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 train.py --config config/NTU.yml --nprocs 8  

Cross-modal Representation Interactive Learning

Take training a fusion model with 8 GPUs on the NTU-RGBD dataset as an example.

bash run.sh tools/fusion.py config/NTU.yml 0,1,2,3,4,5,6,7 8

or

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 fusion.py --config config/NTU.yml --nprocs 8  

Evaluation

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 train.py --config config/NTU.yml --nprocs 1 --eval_only --resume /path/to/model_best.pth.tar 

4. Models Download

Dataset Modality Accuracy Download
NvGesture RGB 89.58 Google Drive
NvGesture Depth 90.62 Google Drive
NvGesture RGB-D 91.70 Google Drive
THU-READ RGB 81.25 Google Drive
THU-READ Depth 77.92 Google Drive
THU-READ RGB-D 87.04 Google Drive
NTU-RGBD(CS) RGB 90.3 Google Drive
NTU-RGBD(CS) Depth 92.7 Google Drive
NTU-RGBD(CS) RGB-D 94.2 Google Drive
NTU-RGBD(CV) RGB 95.4 Google Drive
NTU-RGBD(CV) Depth 96.2 Google Drive
NTU-RGBD(CV) RGB-D 97.3 Google Drive
IsoGD RGB 60.87 Google Drive
IsoGD Depth 60.17 Google Drive
IsoGD RGB-D 66.79 Google Drive

Citation

@inproceedings{zhou2021DRSR,
      title={Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition}, 
      author={Benjia Zhou and Pichao Wang and Jun Wan and Yanyan Liang and Fan Wang and Du Zhang and Zhen Lei and Hao Li and Rong Jin},
      journal={arXiv preprint arXiv:2112.09129},
      year={2021},
}

LICENSE

The code is released under the MIT license.

Copyright

Copyright (C) 2010-2021 Alibaba Group Holding Limited.

Owner
DamoCV
CV team of DAMO academy
DamoCV
Kaggleship: Kaggle Notebooks

Kaggleship: Kaggle Notebooks This repository contains my Kaggle notebooks. They are generally about data science, machine learning, and deep learning.

Erfan Sobhaei 1 Jan 25, 2022
A python library to build Model Trees with Linear Models at the leaves.

A python library to build Model Trees with Linear Models at the leaves.

Marco Cerliani 212 Dec 30, 2022
Code for MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks

MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks This is the code for the paper: MentorNet: Learning Data-Driven Curriculum fo

Google 302 Dec 23, 2022
Official PyTorch code for the paper: "Point-Based Modeling of Human Clothing" (ICCV 2021)

Point-Based Modeling of Human Clothing Paper | Project page | Video This is an official PyTorch code repository of the paper "Point-Based Modeling of

Visual Understanding Lab @ Samsung AI Center Moscow 64 Nov 22, 2022
Learning Saliency Propagation for Semi-supervised Instance Segmentation

Learning Saliency Propagation for Semi-supervised Instance Segmentation PyTorch Implementation This repository contains: the PyTorch implementation of

Berkeley DeepDrive 68 Oct 18, 2022
Feature extraction made simple with torchextractor

torchextractor: PyTorch Intermediate Feature Extraction Introduction Too many times some model definitions get remorselessly copy-pasted just because

Antoine Broyelle 89 Oct 31, 2022
Denoising images with Fourier Ring Correlation loss

Denoising images with Fourier Ring Correlation loss The python code accompanies the working manuscript Image quality measurements and denoising using

2 Mar 12, 2022
Code & Data for Enhancing Photorealism Enhancement

Enhancing Photorealism Enhancement Stephan R. Richter, Hassan Abu AlHaija, Vladlen Koltun Paper | Website (with side-by-side comparisons) | Video (Pap

Intelligent Systems Lab Org 1.1k Dec 31, 2022
CS583: Deep Learning

CS583: Deep Learning

Shusen Wang 2.6k Dec 30, 2022
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation This repo is the official implementation of "MHFormer: Multi-Hypothesis Transforme

Vegetabird 281 Jan 07, 2023
Implementation of Heterogeneous Graph Attention Network

HetGAN Implementation of Heterogeneous Graph Attention Network This is the code repository of paper "Prediction of Metro Ridership During the COVID-19

5 Dec 28, 2021
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 09, 2022
Some experiments with tennis player aging curves using Hilbert space GPs in PyMC. Only experimental for now.

NOTE: This is still being developed! Setup notes This document uses Jeff Sackmann's tennis data. You can obtain it as follows: git clone https://githu

Martin Ingram 1 Jan 20, 2022
Code repository for paper `Skeleton Merger: an Unsupervised Aligned Keypoint Detector`.

Skeleton Merger Skeleton Merger, an Unsupervised Aligned Keypoint Detector. The paper is available at https://arxiv.org/abs/2103.10814. A map of the r

北海若 48 Nov 14, 2022
Official Repository for the ICCV 2021 paper "PixelSynth: Generating a 3D-Consistent Experience from a Single Image"

PixelSynth: Generating a 3D-Consistent Experience from a Single Image (ICCV 2021) Chris Rockwell, David F. Fouhey, and Justin Johnson [Project Website

Chris Rockwell 95 Nov 22, 2022
Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [PaddlePaddle Implementation] Homepage of paper: Paint Transformer: Fee

442 Dec 16, 2022
A stock generator that assess a list of stocks and returns the best stocks for investing and money allocations based on users choices of volatility, duration and number of stocks

Stock-Generator Please visit "Stock Generator.ipynb" for a clearer view and "Stock Generator.py" for scripts. The stock generator is designed to allow

jmengnyay 1 Aug 02, 2022
Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021. Introduction We proposed a novel model training paradi

Lucas 103 Dec 14, 2022
PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

TUPE PyTorch implementation of Rethinking Positional Encoding in Language Pre-training. Quickstart Clone this repository. git clone https://github.com

Jake Tae 5 Jan 27, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

364 Dec 14, 2022