QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Last update: Dec 22, 2022

Related tags

Text Data & NLP moment_detr

Overview

Moment-DETR

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Jie Lei, Tamara L. Berg, Mohit Bansal

For dataset details, please check data/README.md

Getting Started

Prerequisites

Clone this repo

git clone https://github.com/jayleicn/moment_detr.git
cd moment_detr

Prepare feature files

Download moment_detr_features.tar.gz (8GB), extract it under project root directory:

tar -xf path/to/moment_detr_features.tar.gz

Install dependencies.

This code requires Python 3.7, PyTorch, and a few other Python libraries. We recommend creating conda environment and installing all the dependencies as follows:

# create conda env
conda create --name moment_detr python=3.7
# activate env
conda actiavte moment_detr
# install pytorch with CUDA 11.0
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
# install other python packages
pip install tqdm ipython easydict tensorboard tabulate scikit-learn pandas

Training

Training can be launched by running the following command:

bash moment_detr/scripts/train.sh

This will train Moment-DETR for 200 epochs on the QVHighlights train split, with SlowFast and Open AI CLIP features. The training is very fast, it can be done within 4 hours using a single RTX 2080Ti GPU. The checkpoints and other experiment log files will be written into results. For training under different settings, you can append additional command line flags to the command above. For example, if you want to train the model without the saliency loss (by setting the corresponding loss weight to 0):

bash moment_detr/scripts/train.sh --lw_saliency 0

For more configurable options, please checkout our config file moment_detr/config.py.

Inference

Once the model is trained, you can use the following command for inference:

bash moment_detr/scripts/inference.sh CHECKPOINT_PATH SPLIT_NAME

where CHECKPOINT_PATH is the path to the saved checkpoint, SPLIT_NAME is the split name for inference, can be one of val and test.

Pretraining and Finetuning

Moment-DETR utilizes ASR captions for weakly supervised pretraining. To launch pretraining, run:

bash moment_detr/scripts/pretrain.sh

This will pretrain the Moment-DETR model on the ASR captions for 100 epochs, the pretrained checkpoints and other experiment log files will be written into results. With the pretrained checkpoint, we can launch finetuning from a pretrained checkpoint PRETRAIN_CHECKPOINT_PATH as:

bash moment_detr/scripts/train.sh  --resume ${PRETRAIN_CHECKPOINT_PATH}

Note that this finetuning process is the same as standard training except that it initializes weights from a pretrained checkpoint.

Evaluation and Codalab Submission

Please check standalone_eval/README.md for details.

Acknowledgement

We thank Linjie Li for the helpful discussions. This code is based on detr and TVRetrieval XML. We used resources from mdetr, MMAction2, CLIP, SlowFast and HERO_Video_Feature_Extractor. We thank the authors for their awesome open-source contributions.

LICENSE

The annotation files are under CC BY-NC-SA 4.0 license, see ./data/LICENSE. All the code are under MIT license, see LICENSE.

Comments

About experiments on CharadesSTA dataset

Hi, I noticed that you also conduct experiments on CharadesSTA dataset. I'm wondering how you prepare the video feature in CharadesSTA dataset? Could you share the feature files you prepared?

opened by xljh0520 8

About the annotations

Hi @jayleicn, thanks for your great work! I notice that in the annotation files, as shown below, the duration of a video (126s) does not match the actual duration (810s - 660s = 150s). May I ask that should I crop the original video to 126s before processing in this case?

{
    "qid": 8737, 
    "query": "A family is playing basketball together on a green court outside.", 
    "duration": 126, 
    "vid": "bP5KfdFJzC4_660.0_810.0", 
    "relevant_windows": [[0, 16]],
    "relevant_clip_ids": [0, 1, 2, 3, 4, 5, 6, 7], 
    "saliency_scores": [[4, 1, 1], [4, 1, 1], [4, 2, 1], [4, 3, 2], [4, 3, 2], [4, 3, 3], [4, 3, 3], [4, 3, 2]]
}

opened by yeliudev 4

CodaLab Submission Error
Hi, I recently generate the test results and validation results on CodaLab as the following structure.

--Submit.zip ----hl_val_submission.jsonl ----hl_test_submission.jsonl

The CodaLab gave me the error IOError: [Errno 2] No such file or directory: '/tmp/codalab/tmphfqu8Q/run/input/res/hl_test_submission.jsonl'

How can I solve this problem?
opened by vateye 3
Video feature extraction

Hi, thanks for your excellent work! I found that the provided video features include both clip_features and slow_fast features. When it comes to the run_on_video/run.py, the codes only extract the clip features. Is there a mistake here? Besides, could you please provide the run.py extracting both clip and slowfast features? Thank you.

opened by fxqzb 2
About paper

hi, We think that mdetr has great potential, but we look at table 6 in the paper and find that the metics of moment retrieval on the charades-sta dataset is not much higher than that of ivg-dcl (in particular, ivg-dcl adopts C3d feature for video extractor and glove for text embedding), and your work uses clip feature + slowfast). Have you ever tested on other video grounding dataset, like activitynets?

opened by BMEI1314 2
About dataset?

Good job. I have read the paper and the github repository, but I still don’t understand how the features such as clip_features, clip_sub_features, clip_text_features, slowfast_features, etc. under the features folder are extracted and the details of the features extracted? Can you describe it in detail if it is convenient?

opened by dourcer 2
[Request for the approval in competition] Hello. can you approve the request?

Hello.

Thanks for the great work. Motivated by the work and the interesting topic, we sincerely hope to get approved to be in the competition.

Thank you!!! Btw, Sorry for bothering you.

Regards.

opened by wjun0830 1

Meaning of GT saliency scores

Thank you for your great work and open-source code.

I have an issue with the GT saliency scores (only localized 2-sec clips), can you please explain briefly? besides, how Predicted saliency scores (for all 2-sec clip) corresponds to the previous term?

Thanks!

Best, Kevin

Build models...
Loading feature extractors...
Loading CLIP models
Loading trained Moment-DETR model...
Run prediction...
------------------------------idx0
>> query: Chef makes pizza and cuts it up.
>> video_path: run_on_video/example/RoripwjYFp8_60.0_210.0.mp4
>> GT moments: [[106, 122]]
>> Predicted moments ([start_in_seconds, end_in_seconds, score]): [
    [49.967, 64.9129, 0.9421], 
    [66.4396, 81.0731, 0.9271], 
    [105.9434, 122.0372, 0.9234], 
    [93.2057, 103.3713, 0.2222], 
    ..., 
    [45.3834, 52.2183, 0.0005]
   ]
>> GT saliency scores (only localized 2-sec clips):  # what it means?
    [[2, 3, 3], [2, 3, 3], ...]
>> Predicted saliency scores (for all 2-sec clip):  # how this correspond to the GT saliency scores?
    [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]

opened by QinghongLin 1

How do I make my dataset ？

Hi, Congrats on the amazing work. I want to make a data set similar to QVHighlights in my research direction, I have a lot of questions？ 1、What annotation tools do you use? And details in the annotation process. 2、How to use CLIP to extract QVHIGHLIGHTS text features ? Can you provide the specific code？

opened by Yangaiei 1
About File missing in run_on_video

Thank you for your wonderful work! However, when I tried to run your demo in folder run_on_video, the file bpe_simple_vocab_16e6.txt.gz for the tokenizer is missing. Can you provide this file?

FileNotFoundError: [Errno 2] No such file or directory: 'moment_detr/run_on_video/clip/bpe_simple_vocab_16e6.txt.gz'

opened by lmfethan 1

The meaning of "tef"

Hi, I have a question about the "tef" in vision feature:

if self.use_tef:
    tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
    tef_ed = tef_st + 1.0 / ctx_l
    tef = torch.stack([tef_st, tef_ed], dim=1)  # (Lv, 2)
    if self.use_video:
        model_inputs["video_feat"] = torch.cat(
            [model_inputs["video_feat"], tef], dim=1)  # (Lv, Dv+2)
    else:
        model_inputs["video_feat"] = tef

What does "tef" mean in the visual feature? Thanks in advance.

opened by vateye 1

Slowfast config setting

Hi, thanks for your good work and released code!

I have a question regarding the feature extractor: which setting did you adopt for the QVHighlight slowfast feature? e.g., SLOWFAST_8x8_R50.

Thanks!

Kevin

opened by QinghongLin 0
predicted saliency scores
How is the predicted saliency scores (for all 2-sec clip) calculated?

>> Predicted saliency scores (for all 2-sec clip): [-0.9258, -0.8115, -0.7598, ..., 0.0739, 0.1068]

Is it the average of the scores of three people? And why the predicted saliency scores (for all 2-sec clip) is negative.
opened by Yangaiei 0

Releases(checkpoints)

checkpoints(Mar 11, 2022)
We release the following Moment-DETR checkpoints on the QVH datasets:

ASR pretrained checkpoints pt_model_e50.ckpt

Finetuned model ft_model_from_pt_model_e50.ckpt, from pt_model_e50.ckpt

Model trained from scratch scratch_model.ckpt

Source code(tar.gz)
Source code(zip)
ft_model_from_pt_model_e50.ckpt(55.22 MB)
pt_model_e50.ckpt(55.22 MB)
scratch_model.ckpt(55.22 MB)

Owner

Jie Lei 雷杰

UNC CS PhD student, vision+language.

GitHub Repository

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

37 Jan 04, 2023

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

2 Oct 26, 2021

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Making Emojis More Predictable by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad

2 Jan 17, 2022

Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

389 Dec 31, 2022

Mycroft Core, the Mycroft Artificial Intelligence platform.

Mycroft Mycroft is a hackable open source voice assistant. Table of Contents Getting Started Running Mycroft Using Mycroft Home Device and Account Man

6.1k Jan 09, 2023

Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

2 Dec 18, 2022

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

884 Nov 11, 2022

Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

BP-Transformer This repo contains the code for our paper BP-Transformer: Modeling Long-Range Context via Binary Partition Zihao Ye, Qipeng Guo, Quan G

119 Nov 14, 2022

Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

2 Feb 22, 2022

A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

124 Jan 02, 2023

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

23 Sep 27, 2022

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

295 Jan 07, 2023

My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

9 Oct 31, 2022

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

GPT Neo 🎉 1T or bust my dudes 🎉 An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here t

6.7k Dec 28, 2022

结巴中文分词

jieba “结巴”中文分词：做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

29.8k Jan 02, 2023

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

4.6k Jan 01, 2023

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

133 Dec 22, 2022

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

128 Dec 11, 2022

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Related tags

Overview

Moment-DETR

Getting Started

Prerequisites

Training

Inference

Pretraining and Finetuning

Evaluation and Codalab Submission

Acknowledgement

LICENSE

Comments

Releases(checkpoints)

checkpoints(Mar 11, 2022)

Owner

Jie Lei 雷杰

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Code Generation using a large neural network called GPT-J

Mycroft Core, the Mycroft Artificial Intelligence platform.

Train 🤗-transformers model with Poutyne.

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

A paper list of pre-trained language models (PLMs).

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

结巴中文分词

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.