Videocaptioning.pytorch - A simple implementation of video captioning

Last update: Jan 01, 2022

Related tags

Deep Learning videocaptioning.pytorch

Overview

pytorch implementation of video captioning

recommend installing pytorch and python packages using Anaconda

This code is based on video-caption.pytorch

requirements (my environment, other versions of pytorch and torchvision should also support this code (not been verified!))

cuda
pytorch 1.7.1
torchvision 0.8.2
python 3
ffmpeg (can install using anaconda)

python packages

tqdm
pillow
nltk

Data

MSR-VTT. Download and put them in ./data/msr-vtt-data directory

|-data
  |-msr-vtt-data
    |-train-video
    |-test-video
    |-annotations
      |-train_val_videodatainfo.json
      |-test_videodatainfo.json

MSVD. Download and put them in ./data/msvd-data directory

|-data
  |-msvd-data
    |-YouTubeClips
    |-annotations
      |-AllVideoDescriptions.txt

Options

all default options are defined in opt.py or corresponding code file, change them for your like.

Acknowledgements

Some code refers to ImageCaptioning.pytorch

Usage

(Optional) c3d features (not verified)

you can use video-classification-3d-cnn-pytorch to extract features from video.

Steps

preprocess MSVD annotations (convert txt file to json file)

refer to data/msvd-data/annotations/prepro_annotations.ipynb

preprocess videos and labels

# For MSR-VTT dataset
# Train and Validata set
CUDA_VISIBLE_DEVICES=0 python prepro_feats.py \
    --video_path ./data/msr-vtt-data/train-video \
    --video_suffix mp4 \
    --output_dir ./data/msr-vtt-data/resnet152 \
    --model resnet152 \
    --n_frame_steps 40

# Test set
CUDA_VISIBLE_DEVICES=0 python prepro_feats.py \
    --video_path ./data/msr-vtt-data/test-video \
    --video_suffix mp4 \
    --output_dir ./data/msr-vtt-data/resnet152 \
    --model resnet152 \
    --n_frame_steps 40

python prepro_vocab.py \
    --input_json data/msr-vtt-data/annotations/train_val_videodatainfo.json data/msr-vtt-data/annotations/test_videodatainfo.json \
    --info_json data/msr-vtt-data/info.json \
    --caption_json data/msr-vtt-data/caption.json \
    --word_count_threshold 4

# For MSVD dataset
CUDA_VISIBLE_DEVICES=0 python prepro_feats.py \
    --video_path ./data/msvd-data/YouTubeClips \
    --video_suffix avi \
    --output_dir ./data/msvd-data/resnet152 \
    --model resnet152 \
    --n_frame_steps 40

python prepro_vocab.py \
    --input_json data/msvd-data/annotations/MSVD_annotations.json \
    --info_json data/msvd-data/info.json \
    --caption_json data/msvd-data/caption.json \
    --word_count_threshold 2

Training a model

# For MSR-VTT dataset
CUDA_VISIBLE_DEVICES=0 python train.py \
    --epochs 1000 \
    --batch_size 300 \
    --checkpoint_path data/msr-vtt-data/save \
    --input_json data/msr-vtt-data/annotations/train_val_videodatainfo.json \
    --info_json data/msr-vtt-data/info.json \
    --caption_json data/msr-vtt-data/caption.json \
    --feats_dir data/msr-vtt-data/resnet152 \
    --model S2VTAttModel \
    --with_c3d 0 \
    --dim_vid 2048

# For MSVD dataset
CUDA_VISIBLE_DEVICES=0 python train.py \
    --epochs 1000 \
    --batch_size 300 \
    --checkpoint_path data/msvd-data/save \
    --input_json data/msvd-data/annotations/train_val_videodatainfo.json \
    --info_json data/msvd-data/info.json \
    --caption_json data/msvd-data/caption.json \
    --feats_dir data/msvd-data/resnet152 \
    --model S2VTAttModel \
    --with_c3d 0 \
    --dim_vid 2048

test

opt_info.json will be in same directory as saved model.

# For MSR-VTT dataset
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --input_json data/msr-vtt-data/annotations/test_videodatainfo.json \
    --recover_opt data/msr-vtt-data/save/opt_info.json \
    --saved_model data/msr-vtt-data/save/model_xxx.pth \
    --batch_size 100

# For MSVD dataset
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --input_json data/msvd-data/annotations/test_videodatainfo.json \
    --recover_opt data/msvd-data/save/opt_info.json \
    --saved_model data/msvd-data/save/model_xxx.pth \
    --batch_size 100

NOTE

This code is just a simple implementation of video captioning. And I have not verify whether the SCST training process and C3D feature are useful!

Acknowledgements

Some code refers to ImageCaptioning.pytorch

Videocaptioning.pytorch - A simple implementation of video captioning

Related tags

Overview

pytorch implementation of video captioning

requirements (my environment, other versions of pytorch and torchvision should also support this code (not been verified!))

python packages

Data

Options

Acknowledgements

Usage

(Optional) c3d features (not verified)

Steps

NOTE

Acknowledgements

Owner

Yiyu Wang

Video Swin Transformer - PyTorch

How to Train a GAN? Tips and tricks to make GANs work

render sprites into your desktop environment as shaped windows using GTK

UCSD Oasis platform

The easiest tool for extracting radiomics features and training ML models on them.

ColBERT: Contextualized Late Interaction over BERT (SIGIR'20)

Taming Transformers for High-Resolution Image Synthesis

Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

JORLDY an open-source Reinforcement Learning (RL) framework provided by KakaoEnterprise

《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

A python package simulating the quasi-2D pseudospin-1/2 Gross-Pitaevskii equation with NVIDIA GPU acceleration.

CTRL-C: Camera calibration TRansformer with Line-Classification

Lane assist for ETS2, built with the ultra-fast-lane-detection model.

Unofficial implementation of "Coordinate Attention for Efficient Mobile Network Design"

Tree-based Search Graph for Approximate Nearest Neighbor Search

The code of NeurIPS 2021 paper "Scalable Rule-Based Representation Learning for Interpretable Classification".

The aim of this project is to build an AI bot that can play the Wordle game, or more generally Squabble

Code for ACL 2019 Paper: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction"

AdaNet is a lightweight TensorFlow-based framework for automatically learning high-quality models with minimal expert intervention

High-performance moving least squares material point method (MLS-MPM) solver.