Visual dialog agents with pre-trained vision-and-language encoders.

Overview

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Or READ-UP: Referring Expression Agent Dialog with Unified Pretraining.

This repo includes the training/testing code for our paper Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation that has been accepted by CVPR 2021.

Please cite the following paper if you use the code in this repository:

@inproceedings{tu2021learning,
  title={Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation},
  author={Tu, Tao and Ping, Qing and Thattai, Govindarajan and Tur, Gokhan and Natarajan, Prem},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5622--5631},
  year={2021}
}

Repository Setup

Environment

The following environment is recommended:

Instance storage: > 800 GB
pytorch 1.4.0
cuda 10.0

Set up virtual environment and install pytorch:

$ conda create -n read_up python=3.6
$ conda activate read_up
$ git clone https://github.com/amazon-research/read-up.git

# [IMPORTANT] pytorch 1.4.0 have no issue for parallel training
$ conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.0 -c pytorch

Install dependencies:

# Install general dependencies
$ sudo apt-get install build-essential libcap-dev
$ pip install -r requirement.txt

# Install vqa-maskrcnn-benchmark (for feature extraction only)
$ git clone https://gitlab.com/vedanuj/vqa-maskrcnn-benchmark.git
$ cd vqa-maskrcnn-benchmark
$ python setup.py build develop

Install Apex for distributed training

# Apex is used for both `faster-rcnn feature extraction` & `distributed training`
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Dataset

Meta-data

Download the GuessWhat?! dataset:

$ wget https://florian-strub.com/guesswhat.train.jsonl.gz -P data/
$ wget https://florian-strub.com//guesswhat.valid.jsonl.gz -P data/
$ wget https://florian-strub.com//guesswhat.test.jsonl.gz -P data/

Prepare dict.json:

  1. Set up repo as instructed in https://github.com/GuessWhatGame/guesswhat
  2. Generate the dict.json file:
$ python src/guesswhat/preprocess_data/create_dictionary.py -data_dir data -dict_file dict.json -min_occ 3
  1. Copy dict.json file to read-up repo:
$ cd read-up
$ mkdir tf-pretrained-model
$ cp guesswhat/data/dict.json read-up/tf-pretrained-model/

Dataset for Oracle models

1. Dataset for baseline Oracle + Faster-RCNN visual features.

Under vqa-maskrcnn-benchmark/data/, download RCNN model and COCO images:

# download RCNN model
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_model.pth
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_config.yaml

# download COCO data  
$ wget http://images.cocodataset.org/zips/train2014.zip
$ wget http://images.cocodataset.org/zips/val2014.zip
$ wget http://images.cocodataset.org/zips/test2014.zip
$ unzip -j train2014.zip 
$ unzip -j valid2014.zip 
$ unzip -j test2014.zip 

Copy the guesswhat.train/valid/test.jsonl to vqa-maskrcnn-benchmark/data/. Unzip the COCO images into a folder image_dir/COCO_2014/images/, and prepare a npy file for feature extraction later.


$ python bin/prepare_extract_gt_features_gw.py \
    --src vqa-maskrcnn-benchmark/data/guesswhat.train.jsonl \
    --img-dir vqa-maskrcnn-benchmark/image_dir/COCO_2014/images/ \
    --out vqa-maskrcnn-benchmark/image_dir/COCO_2014/npy_files/guesswhat.train.npy

Repeat the same process for val and test. The generated file looks like the following:

{
    {
        'file_name': 'name_of_image_file',
        'file_path': '<path_to_image_file_on_your_disk>',
        'bbox': array([
                        [ x1, y1, width1, height1],
                        [ x2, y2, width2, height2],
                        ...
                    ]),
        'num_box': 2
    },
    ....
}

Extract features from the ground-truth bounding boxes generated before:

$ python bin/extract_features_from_gt.py \
    --model_file vqa-maskrcnn-benchmark/data/detectron_model.pth \
    --config_file vqa-maskrcnn-benchmark/data/detectron_config.yaml \
    --imdb_gt_file vqa-maskrcnn-benchmark/image_dir/COCO_2014/npy_files/guesswhat.train.npy \
    --output_folder data/rcnn/from_gt_gw_xyxy_scale/train

Repeat this process for val and test data.

2. Dataset for our Oracle model.

Download the pretrained VilBERT model (both vanilla and 12-in-1 have similar performance in our experiments).

# download vanilla pretrained model
$ cd vilbert-pretrained-model
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/pretrained_model.bin

# download 12-in-1 pretrained model
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/multi_task_model.bin

Download the features for COCO:

$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets/coco/features_100/COCO_trainval_resnext152_faster_rcnn_genome.lmdb/data.mdb && mv data.mdb COCO_trainval_resnext152_faster_rcnn_genome.lmdb/
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets/coco/features_100/COCO_test_resnext152_faster_rcnn_genome.lmdb/data.mdb && mv data.mdb COCO_test_resnext152_faster_rcnn_genome.lmdb/

Dataset for Q-Gen models

1. Dataset for baseline Q-Gen model [1]

$ wget www.florian-strub.com/github/ft_vgg_img.zip
$ unzip ft_vgg_img.zip -d img/

2. Dataset for VDST Q-Gen model [2]

$ python bin/extract_features.py \
    --model_file vqa-maskrcnn-benchmark/data/detectron_model.pth \
    --config_file vqa-maskrcnn-benchmark/data/detectron_config.yaml \
    --image_dir vqa-maskrcnn-benchmark/image_dir/COCO_2014/images/ \
    --output_folder data/rcnn/from_rcnn/ \
    --batch_size 8

Dataset for Guesser models

1. Dataset for baseline Guesser model[1]

$ cd data/vilbert-multi-task
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets.tar.gz
$ tar -I pigz -xvf datasets.tar.gz datasets/guesswhat/

Model Training & Evaluation

Oracle

To train our Oracle model:

$ python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=1 \
    --node_rank=0 \
    main.py \
    --command train-oracle-vilbert \
    --config config_files/oracle_vilbert.yaml \
    --n-jobs 8

To evaluate our Oracle model:

$ python main.py \
    --command test-oracle-vilbert \
    --config config_files/oracle_vilbert.yaml \
    --load ckpt/oracle_vilbert-sd0/epoch-3.pth

This repo also implements other Oracle models:

  • Baseline Oracle model [1]
  • Baseline Oracle model + Faster-RCNN visual features (our ablation model)

To train and evaluate this model, run the main.py with corresponding config file and command.

Guesser

To train our Guesser model:

$ python main.py \
    --command train-guesser-vilbert \
    --config config_files/guesser_vilbert.yaml \
    --n-jobs 8

To evaluate our Guesser model:

$ python main.py \
    --command test-guesser-vilbert \
    --config config_files/guesser_vilbert.yaml \
    --n-jobs 8 \
    --load ckpt/guesser_vilbert-sd0/best.pth

This repo also implements other Guesser models:

  • Baseline Guesser model [1]

To train and evaluate this model, run the main.py with corresponding config file and command.

Q-Gen

To train our Q-Gen model:

# Distributed training
$ python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=1 \
    --node_rank=0 \
    main.py \
    --command train-qgen-vilbert \
    --config config_files/qgen_vilbert.yaml \
    --n-jobs 8 

# Non-distributed training
$ python main.py \
    --command train-qgen-vilbert \
    --config config_files/qgen_vilbert.yaml \
    --n-jobs 8 

To evalaute our Q-Gen model:

$ python main.py \
    --command test-self-play-all-vilbert \
    --config config_files/self_play_all_vilbert.yaml \
    --n-jobs 8

This repo also implements other Q-Gen models:

  • Baseline Q-Gen model [1]
  • VDST Q-Gen model [2]

To train and evaluate these models, run the main.py with corresponding config file and command.

References

[1] Strub, F., De Vries, H., Mary, J., Piot, B., Courvile, A., & Pietquin, O. (2017, August). End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 2765-2771).

[2] Pang, W., & Wang, X. (2020, April). Visual dialogue state tracking for question generation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 11831-11838).

PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision.

PyTorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{CV2018, author = {Donny You ( Donny You 40 Sep 14, 2022

FIRM-AFL is the first high-throughput greybox fuzzer for IoT firmware.

FIRM-AFL FIRM-AFL is the first high-throughput greybox fuzzer for IoT firmware. FIRM-AFL addresses two fundamental problems in IoT fuzzing. First, it

356 Dec 23, 2022
[CVPR2022] Representation Compensation Networks for Continual Semantic Segmentation

RCIL [CVPR2022] Representation Compensation Networks for Continual Semantic Segmentation Chang-Bin Zhang1, Jia-Wen Xiao1, Xialei Liu1, Ying-Cong Chen2

Chang-Bin Zhang 71 Dec 28, 2022
以孤立语假设和宽度优先搜索为基础,构建了一种多通道堆叠注意力Transformer结构的斗地主ai

ddz-ai 介绍 斗地主是一种扑克游戏。游戏最少由3个玩家进行,用一副54张牌(连鬼牌),其中一方为地主,其余两家为另一方,双方对战,先出完牌的一方获胜。 ddz-ai以孤立语假设和宽度优先搜索为基础,构建了一种多通道堆叠注意力Transformer结构的系统,使其经过大量训练后,能在实际游戏中获

freefuiiismyname 88 May 15, 2022
A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

P-tuning A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''. How to use our code We have released the code

THUDM 562 Dec 27, 2022
A Japanese Medical Information Extraction Toolkit

JaMIE: a Japanese Medical Information Extraction toolkit Joint Japanese Medical Problem, Modality and Relation Recognition The Train/Test phrases requ

7 Dec 12, 2022
Selfplay In MultiPlayer Environments

This project allows you to train AI agents on custom-built multiplayer environments, through self-play reinforcement learning.

200 Jan 08, 2023
"Learning and Analyzing Generation Order for Undirected Sequence Models" in Findings of EMNLP, 2021

undirected-generation-dev This repo contains the source code of the models described in the following paper "Learning and Analyzing Generation Order f

Yichen Jiang 0 Mar 25, 2022
LAMDA: Label Matching Deep Domain Adaptation

LAMDA: Label Matching Deep Domain Adaptation This is the implementation of the paper LAMDA: Label Matching Deep Domain Adaptation which has been accep

Tuan Nguyen 9 Sep 06, 2022
Instance-conditional Knowledge Distillation for Object Detection

Instance-conditional Knowledge Distillation for Object Detection This is a MegEngine implementation of the paper "Instance-conditional Knowledge Disti

MEGVII Research 47 Nov 17, 2022
공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다.

ObsCare_Main 소개 공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다. CCTV의 대수가 급격히 늘어나면서 관리와 효율성 문제와 더불어, 곳곳에 설치된 CCTV를 개별 관제하는 것으로는 응급 상

5 Jul 07, 2022
Visualize Camera's Pose Using Extrinsic Parameter by Plotting Pyramid Model on 3D Space

extrinsic2pyramid Visualize Camera's Pose Using Extrinsic Parameter by Plotting Pyramid Model on 3D Space Intro A very simple and straightforward modu

JEONG HYEONJIN 106 Dec 28, 2022
It is a system used to detect bone fractures. using techniques deep learning and image processing

MohammedHussiengadalla-Intelligent-Classification-System-for-Bone-Fractures It is a system used to detect bone fractures. using techniques deep learni

Mohammed Hussien 7 Nov 11, 2022
Complementary Patch for Weakly Supervised Semantic Segmentation, ICCV21 (poster)

CPN (ICCV2021) This is an implementation of Complementary Patch for Weakly Supervised Semantic Segmentation, which is accepted by ICCV2021 poster. Thi

Ferenas 20 Dec 12, 2022
Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Lottery Jackpots Exist in Pre-trained Models (Paper Link) Requirements Python = 3.7.4 Pytorch = 1.6.1 Torchvision = 0.4.1 Reproduce the Experiment

Yuxin Zhang 27 Jun 28, 2022
PyTorch implementation of Neural Combinatorial Optimization with Reinforcement Learning.

neural-combinatorial-rl-pytorch PyTorch implementation of Neural Combinatorial Optimization with Reinforcement Learning. I have implemented the basic

Patrick E. 454 Jan 06, 2023
Optical Character Recognition + Instance Segmentation for russian and english languages

Распознавание рукописного текста в школьных тетрадях Соревнование, проводимое в рамках олимпиады НТО, разработанное Сбером. Платформа ODS. Результаты

Gerasimov Maxim 21 Dec 19, 2022
Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.

Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.

2.7k Jan 05, 2023
Preprossing-loan-data-with-NumPy - In this project, I have cleaned and pre-processed the loan data that belongs to an affiliate bank based in the United States.

Preprossing-loan-data-with-NumPy In this project, I have cleaned and pre-processed the loan data that belongs to an affiliate bank based in the United

Dhawal Chitnavis 2 Jan 03, 2022
Omniscient Video Super-Resolution

Omniscient Video Super-Resolution This is the official code of OVSR (Omniscient Video Super-Resolution, ICCV 2021). This work is based on PFNL. Datase

36 Oct 27, 2022