A Joint Video and Image Encoder for End-to-End Retrieval

Overview

Frozen️ in Time ❄️ ️️️️

A Joint Video and Image Encoder for End-to-End Retrieval

project page | arXiv | webvid-data alt text Repository containing the code, models, data for end-to-end retrieval. WebVid data can be found here


📝 Preparation

  1. Create conda env conda env create -f requirements/frozen.yml

  2. Create data / experiment folders mkdir data; mkdir exps, note this can just be a symlink to where you want to store big data.

🔧 Finetuning (benchmarks: MSR-VTT)

  1. wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P data; unzip data/MSRVTT.zip -d data

  2. Change num_gpus in the config file accordingly.

  3. Train python train.py --config configs/msrvtt_4f_i21k.json

  4. Test python test.py --resume exps/models/{EXP_NAME}/{EXP_TIMESTAMP}/model_best.pth

For finetuning a pretrained model, set "load_checkpoint": "PATH_TO_MODEL" in the config file.

🏋 ️‍️ Pretraining

  1. Download WebVid-2M (see https://github.com/m-bain/webvid)

  2. Download CC-3M (see https://ai.google.com/research/ConceptualCaptions/download)

  3. Train. python train.py --config CONFIG_PATH. Here are the different options:

    a. Dataset combinations

     i. CC-3M + WebVid2M: configs/cc-webvid2m-pt-i2k.json
     ii. WebVid2M : configs/webvid2m-pt-i2k.json
    

    You can add in an arbitrary number of image/video datasets for pre-training by adding as many dataloaders to the config file dataloader list as your heart desires. Adding more datasets will likely to higher downstream performance.

    b. Number of frames

    For image datasets, this should always be set to video_params": {"num_frames": 1, ...}.

    For video datasets, set this to what you want. N.B. More frames requires = more gpu memory.

    If, like us, you are not a big company and have limited compute, then you will benefit by training via a curriculum on the number of frames. A lot of the knowledge can be learned in the 1-frame setting, as we show in the paper. You can then finetune with more frames. See curriculum learning section

    c. Finetuning

    Set "load_checkpoint": "FULL_MODEL_PATH" in the config file. You can now use different experiment params, such as num_frames, to do curriculum learning for example.

🗄 Pretrained Weights

📚 Curriculum Learning on #frames

Curriculum learning on the number of frames in pretraining achieves similar performance with significant reduction in compute (both memory and training time). This is because model has higher throughput for fewer frames, as well as allowing a bigger batch size for the same gpu memory.

Our best model was trained on 1-frame then finetuned on 4-frames on CC+WebVid2M.

Train on 1-frame until the training loss converges, then finetune on 4-frames with the same config, from the 1-frame checkpoint via setting load_checkpoint in config file. 4-frame finetuning needs much less iterations (~10% of 1-frame setting is sufficient) since most of the knowledge is learned in the 1-frame setting.

📈 Experiment Logging and Visualising

This repository uses a sacred backbone for logging and tracking experiments, with a neptune front end. It makes life a lot easier. If you want to activate this:

  1. Create a neptune.ai account.
  2. Create a project, copy in your credentials in train.py and remove the ValueError
  3. Set neptune: true in your config files.

🎓 Cite

If you use this code in your research, please cite:

@misc{bain2021frozen,
      title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval}, 
      author={Max Bain and Arsha Nagrani and Gül Varol and Andrew Zisserman},
      year={2021},
      eprint={2104.00650},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🙏 Acknowledgements

This code is based off the pytorch-template https://github.com/victoresque/pytorch-template

As well as many good practices adopted from Samuel Albanie's https://github.com/albanie/collaborative-experts

Owner
PhD Student, VGG, Oxford
PyTorch implementation of DCT fast weight RNNs

DCT based fast weights This repository contains the official code for the paper: Training and Generating Neural Networks in Compressed Weight Space. T

Kazuki Irie 4 Dec 24, 2022
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
《Fst Lerning of Temporl Action Proposl vi Dense Boundry Genertor》(AAAI 2020)

Update 2020.03.13: Release tensorflow-version and pytorch-version DBG complete code. 2019.11.12: Release tensorflow-version DBG inference code. 2019.1

Tencent 338 Dec 16, 2022
My freqtrade strategies

My freqtrade-strategies Hi there! This is repo for my freqtrade-strategies. My name is Ilya Zelenchuk, I'm a lecturer at the SPbU university (https://

171 Dec 05, 2022
Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Chris Donahue 98 Dec 14, 2022
Read and write layered TIFF ImageSourceData and ImageResources tags

Read and write layered TIFF ImageSourceData and ImageResources tags Psdtags is a Python library to read and write the Adobe Photoshop(r) specific Imag

Christoph Gohlke 4 Feb 05, 2022
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Learning the Best Pooling Strategy for Visual Semantic Embedding Official PyTorch implementation of the paper Learning the Best Pooling Strategy for V

Jiacheng Chen 106 Jan 06, 2023
City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Code

City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Requirements Python 3.8 or later with all requirements.txt dependencies installed,

88 Dec 12, 2022
Classifying cat and dog images using Kaggle dataset

PyTorch Image Classification Classifies an image as containing either a dog or a cat (using Kaggle's public dataset), but could easily be extended to

Robert Coleman 74 Nov 22, 2022
The authors' official PyTorch SigWGAN implementation

The authors' official PyTorch SigWGAN implementation This repository is the official implementation of [Sig-Wasserstein GANs for Time Series Generatio

9 Jun 16, 2022
Code for "Unsupervised State Representation Learning in Atari"

Unsupervised State Representation Learning in Atari Ankesh Anand*, Evan Racah*, Sherjil Ozair*, Yoshua Bengio, Marc-Alexandre Côté, R Devon Hjelm This

Mila 217 Jan 03, 2023
The repo for reproducing Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study

ECIR Reproducibility Paper: Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study This code corresponds to the reproducibility

ielab 3 Mar 31, 2022
SlotRefine: A Fast Non-Autoregressive Model forJoint Intent Detection and Slot Filling

SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling Reference Main paper to be cited (Di Wu et al., 2020) @article

Moore 34 Nov 03, 2022
Official code for UnICORNN (ICML 2021)

UnICORNN (Undamped Independent Controlled Oscillatory RNN) [ICML 2021] This repository contains the implementation to reproduce the numerical experime

Konstantin Rusch 21 Dec 22, 2022
A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

TorchArrow (Warning: Unstable Prototype) This is a prototype library currently under heavy development. It does not currently have stable releases, an

Facebook Research 536 Jan 06, 2023
Square Root Bundle Adjustment for Large-Scale Reconstruction

RootBA: Square Root Bundle Adjustment Project Page | Paper | Poster | Video | Code Table of Contents Citation Dependencies Installing dependencies on

Nikolaus Demmel 205 Dec 20, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

🌈 ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
Pytorch implementation of NeurIPS 2021 paper: Geometry Processing with Neural Fields.

Geometry Processing with Neural Fields Pytorch implementation for the NeurIPS 2021 paper: Geometry Processing with Neural Fields Guandao Yang, Serge B

Guandao Yang 162 Dec 16, 2022
Survival analysis (SA) is a well-known statistical technique for the study of temporal events.

DAGSurv Survival analysis (SA) is a well-known statistical technique for the study of temporal events. In SA, time-to-an-event data is modeled using a

Rahul Kukreja 1 Sep 05, 2022