Code, Models and Datasets for OpenViDial Dataset

Overview

OpenViDial

This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts》 along with the code to reproduce results in the paper (See Section Baselines).

Introduction

When humans converse, what a speaker will say next significantly depends on what he sees. OpenViDial is a largescale multi-module dialogue dataset for this purpose. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.

The following are two short conversations where visual contexts are crucial.

Detailed statistics for OpenViDial

Attribute value
Number of turns 1.1M
Number of images 1.1M
Vocab size before BPE 70K
Vocab size after BPE 30K
Average length of each episode 14
Average length of each turn 7.6

Download the Dataset

The main folder origin_dir contains training/valid/test sets, each of which is made up by the following files:

├──origin_dir
      └── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.    
      └── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
      └── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
            └── 0.jpg
            └── 1.jpg
            └── ...
      └── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
      └── test.*  (i.e., test.dialogue.jsonl, test.origin.txt, test_images)

If you'd like to take a glance at the a sample of the dataset instead of downloading the full dataset, we provide a data sample here

Data download:

  1. Download [train|valid|test].origin.txt and [train|valid|test].dialogue.jsonl here
  2. Download test_images (~ 20G) here
  3. Download valid_images (~ 20G) here
  4. Download train_images: Since train_images is too big (~ 170G), we split it to 11 zip files (each of which is 17G). Download seperate files zip_train here. Then download and run cat.sh here to include all files in the same directory.
  5. Move all files to origin_dir.

Models

We proposed three models for this dataset. Please refer to the paper for details:

  • Model #1 - NoVisual: use only dialog texts without visual information
  • Model #2 - CoarseVisual: use texts and a pretrained ResNet50 on ImageNet to compute 1000-d feature from each picture
  • Model #3 - FineVisual: use texts and a pretrained Faster R-CNN on Genome to compute 2048-d * K objects features from each picture

Faster R-CNN is an object detection framework. The detection sample and attention over objects during text decoding is shown below.

Requirements

  • python >= 3.6
  • pip install -r requirements.txt

Preprocess directory structure

preprocessed_data_dir is a directory that contains all the preprocessed files (text, image feature mmap, offsets, etc.) generated from origin_data_dir and we use them in training models. The directory structure is shown below.

Note: every train* file or directory should have a 'valid' and a 'test' counterpart, we ignore them below for simplicity.

├──preprocessed_data_dir
      └── train.features.mmap  // numpy mmap array file of shape [num_sents, 1000], each row is a 1000-d ResNet-50 feature
      └── train.objects.mmap  // numpy mmap array file of shape [num_sents, 20, 2048],  faster-rcnn object feature file, each row contain 20 objects feature, which is 2048-d
      └── train.objects_mask.mmap  // numpy mmap array file of shape [num_sents, 20],  faster-rcnn mask file, each row contain 20 objects mask, 1 for valid, 0 for mask
      └── train.offsets.npy  // numpy array file of shape [num_episodes], each item is the offsets of one episode
      └── train.sent_num.npy // numpy array file of shape [num_episodes], each item is the sentence number of one episode

Preprocess text data

We use Moses Tokenizer to tokenize texts and generate offsets arrays: bash ./scripts/preprocess_video_data.sh and followed with byte-pair-encoding and fairseq-preprocess binarization: bash ./scripts/preprocess_text_data.sh

Note: You need to change DATA_DIR, ORIGIN_DIR, OUTPUT_DIR to your own path

Prepare pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features(Used for Model #2 - CoarseVisual)

Preprocessed ResNet50 features (*.features.mmap) (~4G) can be downloaded from here and move them under preprocessed_data_dir/

Download Faster R-CNN features(Used for Model #3 - FineVisual)

Preprocessed Faster R-CNN objects features (*objects.mmap, *objects_mask.mmap) (~160G) can be downloaded from here then move them under preprocessed_data_dir/

Since file train.objects.mmap is too large(100G+), we splitted it to many small pieces like train.objects.mmap.split*, and you need another step to merge all those files together: cat * train.objects.mmap.split* >train.objects.mmap

(Optional) Extract features on your own

If you want to extract some feature on your own, or you'd like to know details of extracting visual features, see video_dialogue_model/extract_features/extract_features.md

Train and Evaluate Model #1 - NoVisual

bash scripts/reproduce_baselines/text_only.sh will train and evaluate NoVisual, Remember to change MODEL_DIR and DATA_DIR for your setup

Train and Evaluate Model #2 - CoarseVisual

bash scripts/reproduce_baselines/text_and_img_feature.sh will train and evaluate CoarseVisual. Remember to change MODEL_DIR and DATA_DIR for your setup

Train and Evaluate Model #3 - FineVisual

bash scripts/reproduce_baselines/text_and_img_objects.sh will train and evaluate FineVisual, Remember to change MODEL_DIR and DATA_DIR for your setup

Other Statistics

  • get length/diversity/stopwords% statistics of system output: train/stats.py

Model benchmark

Model BLEU-1 BLEU-2 BLEU-4 Stopword% Dis-1 Dis-2 Dis-3 Dis-4
1-NV 14.01 3.98 1.07 58.1% 0.0091 0.0355 0.0682 0.1018
2-CV 14.58 4.35 1.14 54.2% 0.0108 0.0448 0.0915 0.1465
3-FV 15.61 4.71 1.22 52.9% 0.0118 0.0502 0.1082 0.1778
NitroFE is a Python feature engineering engine which provides a variety of modules designed to internally save past dependent values for providing continuous calculation.

NitroFE is a Python feature engineering engine which provides a variety of modules designed to internally save past dependent values for providing continuous calculation.

100 Sep 28, 2022
This is a JAX implementation of Neural Radiance Fields for learning purposes.

learn-nerf This is a JAX implementation of Neural Radiance Fields for learning purposes. I've been curious about NeRF and its follow-up work for a whi

Alex Nichol 62 Dec 20, 2022
Implementing DropPath/StochasticDepth in PyTorch

%load_ext memory_profiler Implementing Stochastic Depth/Drop Path In PyTorch DropPath is available on glasses my computer vision library! Introduction

Francesco Saverio Zuppichini 13 Jan 05, 2023
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.

DeepNER An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models. This repository contains complex Deep

Derrick 9 May 30, 2022
Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV2021

Spatial-Temporal Transformer for Dynamic Scene Graph Generation Pytorch Implementation of our paper Spatial-Temporal Transformer for Dynamic Scene Gra

Yuren Cong 119 Jan 01, 2023
Code and description for my BSc Project, September 2021

BSc-Project Disclaimer: This repo consists of only the additional python scripts necessary to run the agent. To run the project on your own personal d

Matin Tavakoli 20 Jul 19, 2022
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 672 Jan 01, 2023
Implementation of ETSformer, state of the art time-series Transformer, in Pytorch

ETSformer - Pytorch Implementation of ETSformer, state of the art time-series Transformer, in Pytorch Install $ pip install etsformer-pytorch Usage im

Phil Wang 121 Dec 30, 2022
Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

ASUTOSH GHANTO 1 Dec 16, 2021
Scalable Optical Flow-based Image Montaging and Alignment

SOFIMA SOFIMA (Scalable Optical Flow-based Image Montaging and Alignment) is a tool for stitching, aligning and warping large 2d, 3d and 4d microscopy

Google Research 16 Dec 21, 2022
Implementation of Uformer, Attention-based Unet, in Pytorch

Uformer - Pytorch Implementation of Uformer, Attention-based Unet, in Pytorch. It will only offer the concat-cross-skip connection. This repository wi

Phil Wang 72 Dec 19, 2022
Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning.

xTune Code for ACL2021 paper Consistency Regularization for Cross-Lingual Fine-Tuning. Environment DockerFile: dancingsoul/pytorch:xTune Install the f

Bo Zheng 42 Dec 09, 2022
Tello Drone Trajectory Tracking

With this library you can track the trajectory of your tello drone or swarm of drones in real time.

Kamran Asgarov 2 Oct 12, 2022
Official Implementation of "Tracking Grow-Finish Pigs Across Large Pens Using Multiple Cameras"

Multi Camera Pig Tracking Official Implementation of Tracking Grow-Finish Pigs Across Large Pens Using Multiple Cameras CVPR2021 CV4Animals Workshop P

44 Jan 06, 2023
Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python THIS PROJECT IS CURRENTLY A WORK IN PROGRESS AND THUS THIS REPOSITORY I

Joshua Marshall 14 Dec 31, 2022
Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Python Kafka reset consumergroup offset example This is a simple example of how

Willi Carlsen 1 Feb 16, 2022
Weakly- and Semi-Supervised Panoptic Segmentation (ECCV18)

Weakly- and Semi-Supervised Panoptic Segmentation by Qizhu Li*, Anurag Arnab*, Philip H.S. Torr This repository demonstrates the weakly supervised gro

Qizhu Li 159 Dec 20, 2022
PyTorch implementation of SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching

SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching This is the official PyTorch implementation of SMODICE: Versatile Offline I

Jason Ma 14 Aug 30, 2022
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022) https://arxiv.org/abs/2203.09388 Jianqi Ma, Zheto

MA Jianqi, shiki 104 Jan 05, 2023
This repository contains the official code of the paper Equivariant Subgraph Aggregation Networks (ICLR 2022)

Equivariant Subgraph Aggregation Networks (ESAN) This repository contains the official code of the paper Equivariant Subgraph Aggregation Networks (IC

Beatrice Bevilacqua 59 Dec 13, 2022