Starter Code for VALUE benchmark

Overview

StarterCode for VALUE Benchmark

This is the starter code for VALUE Benchmark [website], [paper].

Overview of VALUE Benchmark

This repository currently supports all baseline models in VALUE paper, including training with different video-subtitle fusion methods, different input channels, different visual representations and multi-task training. You can also perform transfer evaluation between different tasks with our evaluation code.

Before dive into the baseline models mentioned above, please make yourself familiar with the codebase by going through the examples in Quick Start and Single Task Finetuning.

The code in this repo are copied/modified from open-source implementations made available by HERO.

Updates

  • [7/27/2021] Please re-download violin_test_private.db at this link if you downloaded via script/download_violin.sh prior to 7/27/2021. The previous version is not consistent with our release, sorry for your inconvenience.

Requirements

We use the provided Docker image in HERO for easier reproduction. Please follow Requirements in HERO to set up the environment.

Quick Start

NOTE: Please run bash scripts/download_pretrained.sh $PATH_TO_STORAGE to get the latest pretrained checkpoints from HERO.

We use TVR as an end-to-end example for single-task finetuning.

  1. Download processed data and pretrained models with the following command.

    bash scripts/download_tvr.sh $PATH_TO_STORAGE

    After downloading you should see the following folder structure:

    ├── video_db
    │   ├── tv
    ├── pretrained
    │   └── hero-tv-ht100.pt
    └── txt_db
        ├── tv_subtitles.db
        ├── tvr_train.db
        ├── tvr_val.db
        └── tvr_test.db
    
  2. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/video_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

  3. Run finetuning for the TVR task.

    # inside the container
    horovodrun -np 8 python train_retrieval.py --config config/train-tvr-8gpu.json \
        --output_dir $YOUR_TVR_OUTPUT_DIR
    
    # for single gpu
    python train_retrieval.py --config $YOUR_CONFIG_JSON
  4. Run inference for the TVR task.

    # inference, inside the container
    python eval_vcmr.py --query_txt_db /txt/tvr_val.db/ --split val \
        --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \
        --output_dir $YOUR_TVR_OUTPUT_DIR --checkpoint $BEST_CKPT_STEP \
        --task tvr
    

    The result file will be written at ${YOUR_TVR_OUTPUT_DIR}/results_val/results_${BEST_CKPT_STEP}_all.json. Change to --query_txt_db /txt/tvr_test.db/ --split test for inference on test split. Please format the result file as requested in VALUE Evaluation Tools for submission, this repository does not include formatting.

  5. Misc. In case you would like to reproduce the whole preprocessing pipeline.

  • Text annotation and subtitle preprocessing

    # outside of the container
    # make sure you have downloaded/constructed the video dbs for TV dataset
    # the prepro of tv_subtitles.db requires information from video_db/tv
    bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db \
        $PATH_TO_STORAGE/ann $PATH_TO_STORAGE/video_db
  • Video feature extraction

    We follow feature extraction code at HERO_Video_Feature_Extractor. Please follow the link for instructions to extract video features from ResNet, SlowFast, S3D in Mil-NCE and CLIP-ViT models. These features are saved as separate .npz files per video.

  • Video feature preprocessing and saved to lmdb

    # inside of the container
    
    # Use resnet_slowfast as an example
    # Gather slowfast/resnet feature paths
    python scripts/collect_video_feature_paths.py  \
        --feature_dir $PATH_TO_STORAGE/vis_feat_dir\
        --output $PATH_TO_STORAGE/video_db --dataset $DATASET_NAME \
        --feat_version resnet_slowfast 
    
    # Convert to lmdb
    python scripts/convert_videodb.py \
        --vfeat_info_file $PATH_TO_STORAGE/video_db/$DATASET_NAME/resnet_slowfast_info.pkl \
        --output $PATH_TO_STORAGE/video_db --dataset $DATASET_NAME --frame_length 1.5 \
        --feat_version resnet_slowfast
    • --frame_length: 1 feature per "frame_length" seconds, we use 1.5 in our implementation. set it to be consistent with the one used in feature extraction.
    • --compress: enable compression of lmdb
    • --feat_version: choose from resnet_slowfast, resnet_mil-nce(ResNet+S3D in paper), clip-vit_slowfast, clip-vit_mil-nce(CLIP-ViT+S3D in paper).

VALUE Single Task Finetuning

Video Retrieval Tasks

All video retrieval tasks can be finetuned with train_retrieval.py. We use YC2R as an additional example to show how to perform single-task finetuning on video retrieval tasks.

  1. download data
    # outside of the container
    bash scripts/download_yc2.sh $PATH_TO_STORAGE
  2. train
    # inside the container
    horovodrun -np 4 python train_retrieval.py --config config/train-yc2r-4gpu.json \
        --output_dir $YC2R_EXP
  3. inference
    # inside the container
    python eval_vr.py --query_txt_db /txt/yc2r_test.db/ --split test \
        --vfeat_db /video/yc2/ --sub_txt_db /txt/yc2_subtitles.db/ \
        --output_dir $YC2R_EXP --checkpoint $ckpt --task yc2r
    The result file will be written at $YC2R_EXP/results_test/results_$ckpt_all.json, which can be submitted to the evaluation server. Please format the result file as requested in VALUE Evaluation Tools for submission.

Video QA Tasks

All video question answering models can be finetuned with train_qa.py. We use TVQA to demonstrate how to perform single-task finetuning on video question answering tasks.

  1. download data

    # outside of the container
    bash scripts/download_tvqa.sh $PATH_TO_STORAGE
  2. train

    # inside the container
    horovodrun -np 8 python train_qa.py --config config/train-tvqa-8gpu.json \
        --output_dir $TVQA_EXP
  3. inference

    # inside the container
    horovodrun -np 8 python eval_videoQA.py --query_txt_db /txt/tvqa_test.db/ --split test \
        --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \
        --output_dir $TVQA_EXP --checkpoint $ckpt --task tvqa

    The result file will be written at $TVQA_EXP/results_test/results_$ckpt_all.json, which can be submitted to the evaluation server. Please format the result file as requested in VALUE Evaluation Tools for submission.

    Use eval_violin.py for inference on VIOLIN task.

Captioning tasks

All video captioning models can be finetuned with train_captioning.py. We use TVC to demonstrate how to perform single-task finetuning on video captioning tasks.

  1. download data

    # outside of the container
    bash scripts/download_tvc.sh $PATH_TO_STORAGE
  2. train

    # inside the container
    horovodrun -np 8 python train_captioning.py --config config/train-tvc-8gpu.json \
        --output_dir $TVC_EXP
  3. inference

    # inside the container
    python inf_tvc.py --model_dir $TVC_EXP --ckpt_step $ckpt \
        --target_clip /txt/tvc_val_release.jsonl --output tvc_val_output.jsonl
    • The result file will be written at $TVC_EXP/tvc_val_output.jsonl
    • change to --target_clip /txt/tvc_test_release.jsonl for test results.
    • see scripts/prepro_tvc.sh for LMDB preprocessing.

    Use inf_vatex_en_c.py / inf_yc2c.py for inference on VATEX_EN_C / YC2C task.

VALUE Multi-Task Finetuning

  1. download data

    # outside of the container
    bash scripts/download_all.sh $PATH_TO_STORAGE
  2. train

    # inside the container
    horovodrun -np 8 python train_all_multitask.py \
        --config config/train-all-multitask-8gpu.json \
        --output_dir $AT_PT_FT_EXP
    • --config: change config file for different multi-task settings.
      • MT by domain group: config/train-tv_domain-multitask-8gpu.json / config/train-youtube_domain-multitask-8gpu.json
      • MT by task type: config/train-retrieval-multitask-8gpu.json / config/train-qa-multitask-8gpu.json / config/train-caption-multitask-8gpu.json
      • AT: config/train-all-multitask-8gpu.json
    • For multi-task baselines without pre-training, refer to configs under config/FT_only_configs
  3. inference

    Follow the inference instructions above for each task.

Training with Different Input Channels

To reproduce our experiments with different input channels, change the training config via --config. Take TVR as an example:

  1. Video-only
    # inside the container
    horovodrun -np 8 python train_retrieval.py \
        --config config/FT_only_configs/train-tvr_video_only-8gpu.json \
        --output_dir $TVR_V_only_EXP
  2. Subtitle-only
    # inside the container
    
    horovodrun -np 8 python train_retrieval.py \
        --config config/FT_only_configs/train-tvr_sub_only-8gpu.json \
        --output_dir $TVR_S_only_EXP
  3. Video + Subtitle
    # inside the container
    
    horovodrun -np 8 python train_retrieval.py \
        --config config/FT_only_configs/train-tvr-8gpu.json \
        --output_dir $TVR_EXP

Training with Different Video-Subtitle Fusion Methods

To reproduce our experiments with different video-subtitle fusion methods, change the fusion methods via --model_config for training. Take TVR as an example:

# Training, inside the container
horovodrun -np 8 python train_retrieval.py --config config/FT_only_configs/train-tvr-8gpu.json \
    --output_dir $TVR_EXP --model_config config/model_config/hero_finetune.json
  • config/model_config/hero_finetune.json: default temporal align + cross-modal transformer
  • config/model_config/video_sub_sequence_finetune.json: sequence concatenation
  • config/model_config/video_sub_feature_add_finetune.json: temporal align + summation
  • config/model_config/video_sub_feature_concat_finetune.json: temporal align + concatenation

For two-stream experiments in our paper, please train video-only and subtitle-only models following Training with Video-only and Subtitle-only and use evaluation scripts in two_stream_eval. Take TVR as an example,

# Evaluation, inside the container
python eval_vcmr.py --query_txt_db /txt/tvr_val.db/ --split val \
    --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \
    --video_only_model_dir $TVR_V_only_EXP --video_only_checkpoint $BEST_V_only_CKPT_STEP \
    --sub_only_model_dir $TVR_S_only_EXP --sub_only_checkpoint $BEST_S_only_CKPT_STEP \
    --task tvr

Training with Different Visual Representations

To reproduce our experiments with different visual representations, change the visual representations via --vfeat_version for training. Take TVR as an example:

# inside the container
horovodrun -np 8 python train_retrieval.py --config config/FT_only_configs/train-tvr-8gpu.json \
    --output_dir $TVR_EXP --vfeat_version resnet

We provide all feature variations used in the paper, including:

  • 2D features: resnet and clip-vit
  • 3D features: mil-nce(S3D in paper) and slowfast
  • 2D+3D features: resnet_slowfast, resnet_mil-nce(ResNet+S3D in paper), clip-vit_mil-nce(CLIP-ViT+S3D in paper), clip-vit_slowfast
  • --vfeat_version: default is set to be resnet_slowfast

Task Transferability Evaluation

To reproduce our experiments about task transferability, you will need to first have a trained model on source task and run evaluation on target task. Take TVR->How2R as an example:

  1. Train on TVR task
    # inside the container
    horovodrun -np 8 python train_retrieval.py --config config/FT_only_configs/train-tvr-8gpu.json \
        --output_dir $TVR_EXP 
  2. Evaluate the trained model on How2R task:
    # inside the container
    python eval_vcmr.py --query_txt_db /txt/how2r_val_1k.db/ --split val \
        --vfeat_db /video/how2/ --sub_txt_db /txt/how2_subtitles.db/ \
        --output_dir $TVR_EXP --checkpoint $BEST_TVR_CKPT_STEP \
        --task how2r

Pre-training

All VALUE baselines are based on the pre-trained checkpoint released in HERO. The pre-training experiments are not tested in this codebase.

If you wish to perform pre-training, please refer to instructions in HERO.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{li2021value,
  title={VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation},
  author={Li, Linjie and Lei, Jie and Gan, Zhe and Yu, Licheng and Chen, Yen-Chun and Pillai, Rohit and Cheng, Yu and Zhou, Luowei and Wang, Xin Eric and Wang, William Yang and others},
  booktitle={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
  year={2021}
}

@inproceedings{li2020hero,
  title={HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training},
  author={Li, Linjie and Chen, Yen-Chun and Cheng, Yu and Gan, Zhe and Yu, Licheng and Liu, Jingjing},
  booktitle={EMNLP},
  year={2020}
}

License

MIT

Owner
VALUE Benchmark
VALUE Benchmark
Official Chainer implementation of GP-GAN: Towards Realistic High-Resolution Image Blending (ACMMM 2019, oral)

GP-GAN: Towards Realistic High-Resolution Image Blending (ACMMM 2019, oral) [Project] [Paper] [Demo] [Related Work: A2RL (for Auto Image Cropping)] [C

Wu Huikai 402 Dec 27, 2022
Face Alignment using python

Face Alignment Face Alignment using python Input Image Aligned Face Aligned Face Aligned Face Input Image Aligned Face Input Image Aligned Face Instal

Sajjad Aemmi 28 Nov 23, 2022
Shitty gaze mouse controller

demo.mp4 shitty_gaze_mouse_cotroller install tensofflow, cv2 run the main.py and as it starts it will collect data so first raise your left eyebrow(bo

16 Aug 30, 2022
MLP-Numpy - A simple modular implementation of Multi Layer Perceptron in pure Numpy.

MLP-Numpy A simple modular implementation of Multi Layer Perceptron in pure Numpy. I used the Iris dataset from scikit-learn library for the experimen

Soroush Omranpour 1 Jan 01, 2022
EqGAN - Improving GAN Equilibrium by Raising Spatial Awareness

EqGAN - Improving GAN Equilibrium by Raising Spatial Awareness Improving GAN Equilibrium by Raising Spatial Awareness Jianyuan Wang, Ceyuan Yang, Ying

GenForce: May Generative Force Be with You 149 Dec 19, 2022
official implemntation for "Contrastive Learning with Stronger Augmentations"

CLSA CLSA is a self-supervised learning methods which focused on the pattern learning from strong augmentations. Copyright (C) 2020 Xiao Wang, Guo-Jun

Lab for MAchine Perception and LEarning (MAPLE) 47 Nov 29, 2022
Code accompanying "Evolving spiking neuron cellular automata and networks to emulate in vitro neuronal activity," accepted to IEEE SSCI ICES 2021

Evolving-spiking-neuron-cellular-automata-and-networks-to-emulate-in-vitro-neuronal-activity Code accompanying "Evolving spiking neuron cellular autom

SOCRATES: Self-Organizing Computational substRATES 2 Dec 02, 2022
Measuring if attention is explanation with ROAR

NLP ROAR Interpretability Official code for: Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Toke

Andreas Madsen 19 Nov 13, 2022
NOMAD - A blackbox optimization software

################################################################################### #

Blackbox Optimization 78 Dec 29, 2022
Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

Realtime Unsupervised Depth Estimation from an Image This is the caffe implementation of our paper "Unsupervised CNN for single view depth estimation:

Ravi Garg 227 Nov 28, 2022
An OpenAI-Gym Package for Training and Testing Reinforcement Learning algorithms with OpenSim Models

Authors: Utkarsh A. Mishra and Dr. Dimitar Stanev Advisors: Dr. Dimitar Stanev and Prof. Auke Ijspeert, Biorobotics Laboratory (BioRob), EPFL Video Pl

Utkarsh Mishra 16 Dec 13, 2022
Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

41 Jan 04, 2023
Zero-shot Synthesis with Group-Supervised Learning (ICLR 2021 paper)

GSL - Zero-shot Synthesis with Group-Supervised Learning Figure: Zero-shot synthesis performance of our method with different dataset (iLab-20M, RaFD,

Andy_Ge 62 Dec 21, 2022
The repository contains reproducible PyTorch source code of our paper Generative Modeling with Optimal Transport Maps, ICLR 2022.

Generative Modeling with Optimal Transport Maps The repository contains reproducible PyTorch source code of our paper Generative Modeling with Optimal

Litu Rout 30 Dec 22, 2022
Data-depth-inference - Data depth inference with python

Welcome! This readme will guide you through the use of the code in this reposito

Marco 3 Feb 08, 2022
An experiment to bait a generalized frontrunning MEV bot

Honeypot 🍯 A simple experiment that: Creates a honeypot contract Baits a generalized fronturnning bot with a unique transaction Analyze bot behaviour

0x1355 14 Nov 24, 2022
Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Stock Price Prediction Using Deep Learning Univariate Time Series Predicting stock price using historical data of a company using Neural networks for

Abdultawwab Safarji 7 Nov 27, 2022
DECAF: Deep Extreme Classification with Label Features

DECAF DECAF: Deep Extreme Classification with Label Features @InProceedings{Mittal21, author = "Mittal, A. and Dahiya, K. and Agrawal, S. and Sain

46 Nov 06, 2022
LineBoard - Python+React+MySQL-白板即時系統改善人群行為

LineBoard-白板即時系統改善人群行為 即時顯示實驗室的使用狀況,並遠端預約排隊,以此來改善人們的工作效率 程式架構 運作流程 使用者先至該實驗室網站預約

Bo-Jyun Huang 1 Feb 22, 2022
Companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsura et al.

META-RS This is the companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsu

Bosch Research 7 Dec 09, 2022