TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Last update: Dec 21, 2022

Related tags

Deep Learning TSP

Overview

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

[Paper] [Project Website]

This repository holds the source code, pretrained models, and pre-extracted features for the TSP method.

Please cite this work if you find TSP useful for your research.

@article{alwassel2020tsp,
  title={TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks},
  author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard},
  journal={arXiv preprint arXiv:2011.11479},
  year={2020}
}

Pre-extracted TSP Features

We provide pre-extracted features for ActivityNet v1.3 and THUMOS14 videos. The feature files are saved in H5 format, where we map each video-name to a features tensor of size N x 512, where N is the number of features and 512 is the feature size. Use h5py python package to read the feature files. Not familiar with H5 files or h5py? here is a quick start guide.

For ActivityNet v1.3 dataset

Download: [train subset] [valid subset] [test subset]

Details: The features are extracted from the R(2+1)D-34 encoder pretrained with TSP on ActivityNet (released model) using clips of 16 frames at a frame rate of 15 fps and a stride of 16 frames (i.e., non-overlapping clips). This gives one feature vector per 16/15 ~= 1.067 seconds.

For THUMOS14 dataset

Download: [valid subset] [test subset]

Details: The features are extracted from the R(2+1)D-34 encoder pretrained with TSP on THUMOS14 (released model) using clips of 16 frames at a frame rate of 15 fps and a stride of 1 frame (i.e., dense overlapping clips). This gives one feature vector per 1/15 ~= 0.067 seconds.

Setup

Clone this repository and create the conda environment.

git clone https://github.com/HumamAlwassel/TSP.git
cd TSP
conda env create -f environment.yml
conda activate tsp

Data Preprocessing

Follow the instructions here to download and preprocess the input data.

Training

We provide training scripts for the TSP models and the TAC baselines here.

Feature Extraction

You can extract features from released pretrained models or from local checkpoints using the scripts here.

Acknowledgment: Our source code borrows implementation ideas from pytorch/vision and facebookresearch/VMZ repositories.

Comments

LOSS does not decrease during training

My data set is small, 1500 videos, all under 10 seconds in length. The current training results of this model are as follows:

The experimental Settings adopted are: Batch_size=32,FACTOR=2. Is such a situation normal? If it is abnormal, what should be done?

opened by ZChengLong578 5
H5 files generated about GVF features

Hi, @HumamAlwassel Thanks for your excellent work and for sharing the code. When I was training my dataset, I read your explanation on GVF feature generation. Do I need to combine .pkl files generated by the training set and valid set into .h5 files when I go to step 3?

opened by ZChengLong578 5
The LOSS value is too large and does not decrease

Hi, @HumamAlwassel, I'm sorry to bother you again. I did it without or very little background (no action). Now I have added more background (no Action), but the LOSS value is very large and does not decrease. The specific situation is shown in the following figure: Here are the files for the training set and validation set: What can I do to solve this problem?

opened by ZChengLong578 3
Use the pretraining model to train other datasets

Hi, @HumamAlwassel After downloading the pre-training model as you said, I overwrote the value of epoch to 0. The following changes were then made in the code: I would like you to take a look, is the change I made in the code correct? Or should I replace the initial tac-on-kinetics Pretrained weights with this instead of using it in the resume?

opened by ZChengLong578 2
Inference unseen video using pretrained model

Hi @HumamAlwassel, Thanks for your excellent work. I really appreciated it. I've trained your work on my own dataset. However, I am thinking about how to use trained model to inference unseen videos. Could you give me some examples that export result of a video such as action label and its start or end time.

Best regards,

opened by t2kien 2
Data sampling problems

Hi, @HumamAlwassel I'm sorry to trouble you again. The duration of my dataset action was short and many partitions were removed, as shown below: However, after observation, I find that it does not seem to be the problem with the length of the video. Actions with a length of 0-1.5 seconds are in the video, but actions with a length of 1.5-3 seconds are not in the video. Why is this?

opened by ZChengLong578 2
$RuntimeError(f'<UntrimmedVideoDataset>: got clip of length {vframes.shape[0]} != {self.clip_length}.'$

RuntimeError(f': got clip of length {vframes.shape[0]} != {self.clip_length}.'

Traceback (most recent call last): File "train.py", line 290, in <module> main(args) File "train.py", line 260, in main train_one_epoch(model=model, criterion=criterion, optimizer=optimizer, lr_scheduler=lr_scheduler, File "train.py", line 63, in train_one_epoch for sample in metric_logger.log_every(data_loader, print_freq, header, device=device): File "/media/bruce/2T/projects/TSP/train/../common/utils.py", line 137, in log_every for obj in iterable: File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__ data = self._next_data() File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/bruce/anaconda2/envs/tsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/media/bruce/2T/projects/TSP/train/untrimmed_video_dataset.py", line 86, in __getitem__ raise RuntimeError(f'<UntrimmedVideoDataset>: got clip of length {vframes.shape[0]} != {self.clip_length}.' RuntimeError: <UntrimmedVideoDataset>: got clip of length 15 != 16.filename=/mnt/nas/bruce14t/THUMOS14/valid/video_validation_0000420.mp4, clip_t_start=526.7160991305855, clip_t_end=527.7827657972522, fps=30.0, t_start=498.2, t_end=546.9

I am very impressed by your wonderful work. When I try to reproduce the bash train_tsp_on_thumos14.sh for the THUMOS14 dataset, I got the above data loading issue. The calculation of the start and end of input clips seems not to work well for all the clips (code Line 74-78 of train/untrimmed_video_dataset.py). Could you provide some help with it? Thank you very much in advance.

opened by bruceyo 2
How do I calculate mean and std for a new dataset?

Thanks for your inspiring code with detailed explanations! I have learnt a lot from that and now I'm trying to do some experiments in another dataset. But some implementation details confuse me.

I notice that in the dataset transform part, there is a normalizing step. normalize = T.Normalize(mean=[0.43216, 0.394666, 0.37645], std=[0.22803, 0.22145, 0.216989])

So how do I calculate the mean and std for a new dataset? Should I extract frames from videos first, then calculate mean & std inside all the frames in all videos for each RGB channel?

opened by xjtupanda 1
$Similar to issue #11 getting RuntimeError(f'<UntrimmedVideoDataset>: got clip of length {vframes.shape[0]} != {self.clip_length}.'$
Similar to issue #11 getting RuntimeError(f': got clip of length {vframes.shape[0]} != {self.clip_length}.'
I am working with ActivityNet-v1.3 data converted to grayscale.

I followed the preprocessing step highlighted here.

However, I am still facing this issue similar to #11 , wanted to check if I am missing something or if there are any known fixes.

Example from the log:

RuntimeError: <UntrimmedVideoDataset>: got clip of length 15 != 16.filename=~/ActivityNet/grayscale_split/train/v_bNuRrXSjJl0.mp4, clip_t_start=227.63093165194988, clip_t_end=228.69759831861654, fps=30.0, t_start=219.1265882558503, t_end=228.7

RuntimeError: <UntrimmedVideoDataset>: got clip of length 13 != 16.filename=~/ActivityNet/grayscale_split/train/v_nTNkGOtp7aQ.mp4, clip_t_start=33.341372258903775, clip_t_end=34.408038925570445, fps=30.0, t_start=25.58139772698908, t_end=34.53333333333333

RuntimeError: <UntrimmedVideoDataset>: got clip of length 1 != 16.filename=~/ActivityNet/grayscale_split/train/v_7Iy7Cjv2SAE.mp4, clip_t_start=190.79558490339477, clip_t_end=191.86225157006143, fps=30.0, t_start=131.42849249141963, t_end=195.0

Also, is there a recommended way to skip these files instead of raising the issue while training. The above issues came for different runs and at different epochs.
opened by vc-30 1
Accuracy don't increase

Thank you for your reply! I used the above code to train my data set and found that the accuracy rate has not changed much and has remained around 3. Here is the output of the training: Do you know what caused it?

opened by ZChengLong578 1
question about pretrain-model

Hi, thank you for your excellent work. I have a problem with your model. It is extracted TSP Features in ActivityNet. When the objects present in my video are not in ActivityNet, the model fails to recognize. As an example, ActivityNet's animals are only dogs and horses, but when my video is a cat, I run into trouble. I'm guessing because the model hasn't seen cats, one of my solution is to use ImageNet-22k pretrained weights and then do extracted TSP Features in ActivityNet. I don't know if my thinking is right. If it is correct, could you please update your code about using ImageNet-22k pretrained weights? Thank you very much for your excellent work.

opened by qt2139 1

Releases(thumos14_features)

thumos14_features(Apr 14, 2021)

Pre-extracted TSP features for THUMOS14
Source code(tar.gz)
Source code(zip)
r2plus1d_34-tsp_on_thumos14-test_features.h5(1212.17 MB)
r2plus1d_34-tsp_on_thumos14-valid_features.h5(1098.67 MB)
model_weights(Apr 14, 2021)

This a release of TSP pretrained models as well as baseline models.
Source code(tar.gz)
Source code(zip)
r2plus1d_18-tac_on_activitynet-backbone_lr_0.0001-fc_lr_0.004-epoch_5-9f56941a.pth(239.74 MB)
r2plus1d_18-tac_on_kinetics-76ce975c.pth(120.31 MB)
r2plus1d_18-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_6-22835b73.pth(239.76 MB)
r2plus1d_34-tac_on_activitynet-backbone_lr_0.0001-fc_lr_0.002-epoch_5-98ccac94.pth(485.49 MB)
r2plus1d_34-tac_on_kinetics-0547130e.pth(243.26 MB)
r2plus1d_34-tac_on_thumos14-backbone_lr_0.00001-fc_lr_0.002-epoch_3-54b5c8aa.pth(484.79 MB)
r2plus1d_34-tsp_on_activitynet-avg_gvf-backbone_lr_0.0001-fc_lr_0.004-epoch_5-8b74eaa2.pth(485.51 MB)
r2plus1d_34-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_5-0d2cf854.pth(485.51 MB)
r2plus1d_34-tsp_on_activitynet-no_gvf-backbone_lr_0.0001-fc_lr_0.004-epoch_5-fb38fdd2.pth(485.50 MB)
r2plus1d_34-tsp_on_thumos14-max_gvf-backbone_lr_0.0001-fc_lr_0.004-epoch_4-e6a30b2f.pth(484.81 MB)
r3d_18-tac_on_activitynet-backbone_lr_0.001-fc_lr_0.01-epoch_5-31fd6e95.pth(253.89 MB)
r3d_18-tac_on_kinetics-dcd952c6.pth(127.36 MB)
r3d_18-tsp_on_activitynet-max_gvf-backbone_lr_0.0001-fc_lr_0.002-epoch_6-85584422.pth(253.91 MB)
activitynet_features(Apr 14, 2021)

Pre-extracted TSP features for Activitynet v1.3
Source code(tar.gz)
Source code(zip)
r2plus1d_34-tsp_on_activitynet-test_features.h5(962.38 MB)
r2plus1d_34-tsp_on_activitynet-train_features.h5(1969.45 MB)
r2plus1d_34-tsp_on_activitynet-valid_features.h5(975.15 MB)

Owner

Humam Alwassel

PhD Student, Computer Vision Researcher, and Deep Learning "Hacker".

GitHub Repository http://humamalwassel.com/publication/tsp/

Benchmarks for the Optimal Power Flow Problem

Power Grid Lib - Optimal Power Flow This benchmark library is curated and maintained by the IEEE PES Task Force on Benchmarks for Validation of Emergi

207 Dec 08, 2022

Code for "Reconstructing 3D Human Pose by Watching Humans in the Mirror", CVPR 2021 oral

Reconstructing 3D Human Pose by Watching Humans in the Mirror Qi Fang*, Qing Shuai*, Junting Dong, Hujun Bao, Xiaowei Zhou CVPR 2021 Oral The videos a

178 Dec 13, 2022

i3DMM: Deep Implicit 3D Morphable Model of Human Heads

i3DMM: Deep Implicit 3D Morphable Model of Human Heads CVPR 2021 (Oral) Arxiv | Poject Page This project is the official implementation our work, i3DM

60 Jan 03, 2023

Facial detection, landmark tracking and expression transfer library for Windows, Linux and Mac

Welcome to the CSIRO Face Analysis SDK. Documentation for the SDK can be found in doc/documentation.html. All code in this SDK is provided according t

7 Jul 16, 2020

General Vision Benchmark, a project from OpenGVLab

Introduction We build GV-B(General Vision Benchmark) on Classification, Detection, Segmentation and Depth Estimation including 26 datasets for model e

174 Dec 27, 2022

Multilingual Image Captioning

Multilingual Image Captioning Authors: Bhavitvya Malik, Gunjan Chhablani Demo Link: https://huggingface.co/spaces/flax-community/multilingual-image-ca

32 Nov 25, 2022

TFOD-MASKRCNN - Tensorflow MaskRCNN With Python

Tensorflow- MaskRCNN Steps git clone https://github.com/amalaj7/TFOD-MASKRCNN.gi

2 Jan 18, 2022

Video-Music Transformer

VMT Video-Music Transformer (VMT) is an attention-based multi-modal model, which generates piano music for a given video. Paper https://arxiv.org/abs/

5 Jul 13, 2022

Learning nonlinear operators via DeepONet

DeepONet: Learning nonlinear operators The source code for the paper Learning nonlinear operators via DeepONet based on the universal approximation th

239 Jan 02, 2023

Progressive Coordinate Transforms for Monocular 3D Object Detection

Progressive Coordinate Transforms for Monocular 3D Object Detection This repository is the official implementation of PCT. Introduction In this paper,

58 Nov 06, 2022

This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach.

PlyTitle_Generation This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach. The paper has been accepted by

6 Jan 03, 2022

✅ How Robust are Fact Checking Systems on Colloquial Claims?. In NAACL-HLT, 2021.

How Robust are Fact Checking Systems on Colloquial Claims? Official PyTorch implementation of our NAACL paper: Byeongchang Kim*, Hyunwoo Kim*, Seokhee

19 Mar 15, 2022

Process JSON files for neural recording sessions using Medtronic's BrainSense Percept PC neurostimulator

percept_processing This code processes JSON files for streamed neural data using Medtronic's Percept PC neurostimulator with BrainSense Technology for

3 Jun 06, 2022

Tutel MoE: An Optimized Mixture-of-Experts Implementation

Project Tutel Tutel MoE: An Optimized Mixture-of-Experts Implementation. Supported Framework: Pytorch Supported GPUs: CUDA(fp32 + fp16), ROCm(fp32) Ho

344 Dec 29, 2022

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Long-term-Motion-in-3D-Scenes This is an implementation of the CVPR'21 paper "Synthesizing Long-Term 3D Human Motion and Interaction in 3D". Please ch

76 Dec 13, 2022

Code Impementation for "Mold into a Graph: Efficient Bayesian Optimization over Mixed Spaces"

Code Impementation for "Mold into a Graph: Efficient Bayesian Optimization over Mixed Spaces" This repo contains the implementation of GEBO algorithm.

2 Mar 22, 2022

TriMap: Large-scale Dimensionality Reduction Using Triplets

TriMap TriMap is a dimensionality reduction method that uses triplet constraints to form a low-dimensional embedding of a set of points. The triplet c

235 Dec 24, 2022

Code for the Weighted, Accelerated and Restarted Primal-dual algorithm. This algorithm achieves stable linear convergence for reconstruction from undersampled noisy measurements under an approximate sharpness condition. See the paper for details.

WARPd Code for the Weighted, Accelerated and Restarted Primal-dual algorithm. This algorithm achieves stable linear convergence for reconstruction fro

1 Apr 08, 2022

LSTM-VAE Implementation and Relevant Evaluations

LSTM-VAE Implementation and Relevant Evaluations Before using any file in this repository, please create two directories under the root directory name

5 Oct 08, 2022

Official PyTorch Implementation of paper EAN: Event Adaptive Network for Efficient Action Recognition

27 Nov 07, 2022

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Related tags

Overview

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Pre-extracted TSP Features

For ActivityNet v1.3 dataset

For THUMOS14 dataset

Setup

Data Preprocessing

Training

Feature Extraction

Comments

Releases(thumos14_features)

thumos14_features(Apr 14, 2021)

model_weights(Apr 14, 2021)

activitynet_features(Apr 14, 2021)

Owner

Humam Alwassel

Benchmarks for the Optimal Power Flow Problem

Code for "Reconstructing 3D Human Pose by Watching Humans in the Mirror", CVPR 2021 oral

i3DMM: Deep Implicit 3D Morphable Model of Human Heads

Facial detection, landmark tracking and expression transfer library for Windows, Linux and Mac

General Vision Benchmark, a project from OpenGVLab

Multilingual Image Captioning

TFOD-MASKRCNN - Tensorflow MaskRCNN With Python

Video-Music Transformer

Learning nonlinear operators via DeepONet

Progressive Coordinate Transforms for Monocular 3D Object Detection

This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach.

✅ How Robust are Fact Checking Systems on Colloquial Claims?. In NAACL-HLT, 2021.

Process JSON files for neural recording sessions using Medtronic's BrainSense Percept PC neurostimulator

Tutel MoE: An Optimized Mixture-of-Experts Implementation

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Code Impementation for "Mold into a Graph: Efficient Bayesian Optimization over Mixed Spaces"

TriMap: Large-scale Dimensionality Reduction Using Triplets

Code for the Weighted, Accelerated and Restarted Primal-dual algorithm. This algorithm achieves stable linear convergence for reconstruction from undersampled noisy measurements under an approximate sharpness condition. See the paper for details.

LSTM-VAE Implementation and Relevant Evaluations

Official PyTorch Implementation of paper EAN: Event Adaptive Network for Efficient Action Recognition