PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Last update: Dec 16, 2022

Overview

Long Short-Term Transformer for Online Action Detection

Introduction

This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Environment

The code is developed with CUDA 10.2, Python >= 3.7.7, PyTorch >= 1.7.1
1. [Optional but recommended] create a new conda environment.
```
conda create -n lstr python=3.7.7
```
  And activate the environment.
```
conda activate lstr
```
2. Install the requirements
```
pip install -r requirements.txt
```

Data Preparation

Download the THUMOS'14 and TVSeries datasets.
Extract feature representations for video frames.
- For ActivityNet pretrained features, we use the ResNet-50 model for the RGB and optical flow inputs. We recommend to use this checkpoint in MMAction2.
- For Kinetics pretrained features, we use the ResNet-50 model for the RGB inputs. We recommend to use this checkpoint in MMAction2. We use the BN-Inception model for the optical flow inputs. We recommend to use the model here.
Note: We compute the optical flow using DenseFlow.

If you want to use our dataloaders, please make sure to put the files as the following structure:

THUMOS'14 dataset:

$YOUR_PATH_TO_THUMOS_DATASET
├── rgb_kinetics_resnet50/
|   ├── video_validation_0000051.npy (of size L x 2048)
│   ├── ...
├── flow_kinetics_bninception/
|   ├── video_validation_0000051.npy (of size L x 1024)
|   ├── ...
├── target_perframe/
|   ├── video_validation_0000051.npy (of size L x 22)
|   ├── ...

TVSeries dataset:

$YOUR_PATH_TO_TVSERIES_DATASET
├── rgb_kinetics_resnet50/
|   ├── Breaking_Bad_ep1.npy (of size L x 2048)
│   ├── ...
├── flow_kinetics_bninception/
|   ├── Breaking_Bad_ep1.npy (of size L x 1024)
|   ├── ...
├── target_perframe/
|   ├── Breaking_Bad_ep1.npy (of size L x 31)
|   ├── ...

Create softlinks of datasets:

cd long-short-term-transformer
ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS
ln -s $YOUR_PATH_TO_TVSERIES_DATASET data/TVSeries

Training

Training LSTR with 512 seconds long-term memory and 8 seconds short-term memory requires less 3 GB GPU memory.

The commands are as follows.

cd long-short-term-transformer
# Training from scratch
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT

Online Inference

There are three kinds of evaluation methods in our code.

First, you can use the config SOLVER.PHASES "['train', 'test']" during training. This process devides each test video into non-overlapping samples, and makes prediction on the all the frames in the short-term memory as if they were the latest frame. Note that this evaluation result is not the final performance, since (1) for most of the frames, their short-term memory is not fully utlized and (2) for simplicity, samples in the boundaries are mostly ignored.
```
cd long-short-term-transformer
# Inference along with training
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    SOLVER.PHASES "['train', 'test']"
```
Second, you could run the online inference in batch mode. This process evaluates all video frames by considering each of them as the latest frame and filling the long- and short-term memories by tracing back in time. Note that this evaluation result matches the numbers reported in the paper, but batch mode cannot be further accelerated as descibed in paper's Sec 3.6. On the other hand, this mode can run faster when you use a large batch size, and we recomand to use it for performance benchmarking.
```
cd long-short-term-transformer
# Online inference in batch mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE batch
```
Third, you could run the online inference in stream mode. This process tests frame by frame along the entire video, from the beginning to the end. Note that this evaluation result matches the both LSTR's performance and runtime reported in the paper. It processes the entire video as LSTR is applied to real-world scenarios. However, currently it only supports to test one video at each time.
```
cd long-short-term-transformer
# Online inference in stream mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream DATA.TEST_SESSION_SET "['$VIDEO_NAME']"
```

Evaluation

Evaluate LSTR's performance for online action detection using perframe mAP or mcAP.

cd long-short-term-transformer
python tools/eval/eval_perframe --pred_scores_file $PRED_SCORES_FILE

Evaluate LSTR's performance at different action stages by evaluating each decile (ten-percent interval) of the video frames separately.

cd long-short-term-transformer
python tools/eval/eval_perstage --pred_scores_file $PRED_SCORES_FILE

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@inproceedings{xu2021long,
	title={Long Short-Term Transformer for Online Action Detection},
	author={Xu, Mingze and Xiong, Yuanjun and Chen, Hao and Li, Xinyu and Xia, Wei and Tu, Zhuowen and Soatto, Stefano},
	booktitle={Conference on Neural Information Processing Systems (NeurIPS)},
	year={2021}
}

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Related tags

Overview

Long Short-Term Transformer for Online Action Detection

Introduction

Environment

Data Preparation

Training

Online Inference

Evaluation

Citations

Security

License

Owner

SeMask: Semantically Masked Transformers for Semantic Segmentation.

Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Multi-Horizon-Forecasting-for-Limit-Order-Books

Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

An Unsupervised Graph-based Toolbox for Fraud Detection

Video Frame Interpolation with Transformer (CVPR2022)

A Pytorch Implementation of Domain adaptation of object detector using scissor-like networks

TensorRT examples (Jetson, Python/C++)(object detection)

RoboDesk A Multi-Task Reinforcement Learning Benchmark

Human head pose estimation using Keras over TensorFlow.

We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will make a program to Crack Any Password Using Python. Show some ❤️ by starring this repository!

MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation

Library for time-series-forecasting-as-a-service.

Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

ivadomed is an integrated framework for medical image analysis with deep learning.

ToFFi - Toolbox for Frequency-based Fingerprinting of Brain Signals

Anomaly detection related books, papers, videos, and toolboxes

QA-GNN: Question Answering using Language Models and Knowledge Graphs

Robust fine-tuning of zero-shot models