End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

Last update: Dec 16, 2022

Overview

PDVC

Official implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

[paper] [valse论文速递(Chinese)]

This repo supports:

two video captioning tasks: dense video captioning and video paragraph captioning
two datasets: ActivityNet Captions and YouCook2
video features containing C3D, TSN, and TSP.
visualization of the generated captions of your own videos

Table of Contents:

Updates
Introduction
Preparation
Running PDVC on Your Own Videos
Training and Validation
Performance
- Dense video captioning
- Video paragraph captioning
Citation
Acknowledgement

Updates

(2021.11.19) add code for running PDVC on raw videos and visualize the generated captions (support Chinese and other non-English languages)
(2021.11.19) add pretrained models with TSP features. It achieves 9.03 METEOR(2021) and 6.05 SODA_c, a very competitive results on ActivityNet Captions without self-critical sequence training.
(2021.08.29) add TSN pretrained models and support YouCook2

Introduction

PDVC is a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art methods when its localization accuracy is on par with them.

Preparation

Environment: Linux, GCC>=5.4, CUDA >= 9.2, Python>=3.7, PyTorch>=1.5.1

Clone the repo

git clone --recursive https://github.com/ttengwang/PDVC.git

Create vitual environment by conda

conda create -n PDVC python=3.7
source activate PDVC
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
conda install ffmpeg
pip install -r requirement.txt

Compile the deformable attention layer (requires GCC >= 5.4).

cd pdvc/ops
sh make.sh

Running PDVC on Your Own Videos

Download a pretrained model (GoogleDrive) with TSP features and put it into ./save. Then run:

video_folder=visualization/videos
output_folder=visualization/output
pdvc_model_path=save/anet_tsp_pdvc/model-best.pth
output_language=en
bash test_and_visualize.sh $video_folder $output_folder $pdvc_model_path $output_language

check the $output_folder, you will see a new video with embedded captions. Note that we generate non-English captions by translating the English captions by GoogleTranslate. To produce chinese captions, set output_language=zh-cn. For other language support, find the abbreviation of your language at this url, and you also may need to download a font supporting your language and put it into ./visualization.

Training and Validation

Download Video Features

cd data/anet/features
bash download_anet_c3d.sh
# bash download_anet_tsn.sh
# bash download_i3d_vggish_features.sh
# bash download_tsp_features.sh

Dense Video Captioning

PDVC with learnt proposals

# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}
# The script will evaluate the model for every epoch. The results and logs are saved in `./save`.

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}

PDVC with ground-truth proposals

# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Video Paragraph Captioning

PDVC with learnt proposals

# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID} 

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}

PDVC with ground-truth proposals

# Training
config_path=cfgs/anet_c3d_pdvc_gt.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Performance

Dense video captioning

Model	Features	config_path	Url	Recall	Precision	BLEU4	METEOR2018	METEOR2021	CIDEr	SODA_c
PDVC_light	C3D	cfgs/anet_c3d_pdvcl.yml	Google Drive	55.30	58.42	1.55	7.13	7.66	24.80	5.23
PDVC	C3D	cfgs/anet_c3d_pdvc.yml	Google Drive	55.20	57.36	1.82	7.48	8.09	28.16	5.47
PDVC_light	TSN	cfgs/anet_tsn_pdvcl.yml	Google Drive	55.34	57.97	1.66	7.41	7.97	27.23	5.51
PDVC	TSN	cfgs/anet_tsn_pdvc.yml	Google Drive	56.21	57.46	1.92	8.00	8.63	29.00	5.68
PDVC_light	TSP	cfgs/anet_tsp_pdvcl.yml	Google Drive	55.24	57.78	1.77	7.94	8.55	28.25	5.95
PDVC	TSP	cfgs/anet_tsp_pdvc.yml	Google Drive	55.79	57.39	2.17	8.37	9.03	31.14	6.05

Notes:

In the paper, we follow the most previous methods to use the evaluation toolkit in ActivityNet Challenge 2018. Note that the latest evluation tookit (METEOR2021) gives the same CIDEr/BLEU4 but a higher METEOR score.
In the paper, we use an old version of SODA_c implementation, while here we use an updated version for convenience.

Video paragraph captioning

Model	Features	config_path	BLEU4	METEOR	CIDEr
PDVC	C3D	cfgs/anet_c3d_pdvc.yml	9.67	14.74	16.43
PDVC	TSN	cfgs/anet_tsn_pdvc.yml	10.18	15.96	20.66
PDVC	TSP	cfgs/anet_tsp_pdvc.yml	10.46	16.42	20.91

Notes:

Paragraph-level scores are evaluated on the ActivityNet Entity ae-val set.

Citation

If you find this repo helpful, please consider citing:

@inproceedings{wang2021end,
  title={End-to-End Dense Video Captioning with Parallel Decoding},
  author={Wang, Teng and Zhang, Ruimao and Lu, Zhichao and Zheng, Feng and Cheng, Ran and Luo, Ping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6847--6857},
  year={2021}
}

@ARTICLE{wang2021echr,
  author={Wang, Teng and Zheng, Huicheng and Yu, Mingjing and Tian, Qian and Hu, Haifeng},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={Event-Centric Hierarchical Representation for Dense Video Captioning}, 
  year={2021},
  volume={31},
  number={5},
  pages={1890-1900},
  doi={10.1109/TCSVT.2020.3014606}}

Acknowledgement

The implementation of Deformable Transformer is mainly based on Deformable DETR. The implementation of the captioning head is based on ImageCaptioning.pytorch. We thanks the authors for their efforts.

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

Related tags

Overview

PDVC

Updates

Introduction

Preparation

Running PDVC on Your Own Videos

Training and Validation

Download Video Features

Dense Video Captioning

Video Paragraph Captioning

Performance

Dense video captioning

Video paragraph captioning

Citation

Acknowledgement

Owner

Teng Wang

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

[CVPR 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs

Code base for reproducing results of I.Schubert, D.Driess, O.Oguz, and M.Toussaint: Learning to Execute: Efficient Learning of Universal Plan-Conditioned Policies in Robotics. NeurIPS (2021)

Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Pytorch0.4.1 codes for InsightFace

Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences", CVPR 2021.

The Balloon Learning Environment - flying stratospheric balloons with deep reinforcement learning.

[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

La source de mon module 'pyfade' disponible sur Pypi.

Point Cloud Registration Network

This repo will contain code to reproduce and build upon understanding transfer learning

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications

A Closer Look at Structured Pruning for Neural Network Compression

This is a deep learning-based method to segment deep brain structures and a brain mask from T1 weighted MRI.

Predicting Event Memorability from Contextual Visual Semantics