An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Overview

Bottom-Up and Top-Down Attention for Visual Question Answering

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

The implementation follows the VQA system described in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering" (https://arxiv.org/abs/1707.07998) and "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge" (https://arxiv.org/abs/1708.02711).

Results

Model Validation Accuracy Training Time
Reported Model 63.15 12 - 18 hours (Tesla K40)
Implemented Model 63.58 40 - 50 minutes (Titan Xp)

The accuracy was calculated using the VQA evaluation metric.

About

This is part of a project done at CMU for the course 11-777 Advanced Multimodal Machine Learning and a joint work between Hengyuan Hu, Alex Xiao, and Henry Huang.

As part of our project, we implemented bottom up attention as a strong VQA baseline. We were planning to integrate object detection with VQA and were very glad to see that Peter Anderson and Damien Teney et al. had already done that beautifully. We hope this clean and efficient implementation can serve as a useful baseline for future VQA explorations.

Implementation Details

Our implementation follows the overall structure of the papers but with the following simplifications:

  1. We don't use extra data from Visual Genome.
  2. We use only a fixed number of objects per image (K=36).
  3. We use a simple, single stream classifier without pre-training.
  4. We use the simple ReLU activation instead of gated tanh.

The first two points greatly reduce the training time. Our implementation takes around 200 seconds per epoch on a single Titan Xp while the one described in the paper takes 1 hour per epoch.

The third point is simply because we feel the two stream classifier and pre-training in the original paper is over-complicated and not necessary.

For the non-linear activation unit, we tried gated tanh but couldn't make it work. We also tried gated linear unit (GLU) and it works better than ReLU. Eventually we choose ReLU due to its simplicity and since the gain from using GLU is too small to justify the fact that GLU doubles the number of parameters.

With these simplifications we would expect the performance to drop. For reference, the best result on validation set reported in the paper is 63.15. The reported result without extra data from visual genome is 62.48, the result using only 36 objects per image is 62.82, the result using two steam classifier but not pre-trained is 62.28 and the result using ReLU is 61.63. These numbers are cited from the Table 1 of the paper: "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge". With all the above simplification aggregated, our first implementation got around 59-60 on validation set.

To shrink the gap, we added some simple but powerful modifications. Including:

  1. Add dropout to alleviate overfitting
  2. Double the number of neurons
  3. Add weight normalization (BN seems not work well here)
  4. Switch to Adamax optimizer
  5. Gradient clipping

These small modifications bring the number back to ~62.80. We further change the concatenation based attention module in the original paper to a projection based module. This new attention module is inspired by the paper "Modeling Relationships in Referential Expressions with Compositional Modular Networks" (https://arxiv.org/pdf/1611.09978.pdf), but with some modifications (implemented in attention.NewAttention). With the help of this new attention, we boost the performance to ~63.58, surpassing the reported best result with no extra data and less computation cost.

Usage

Prerequisites

Make sure you are on a machine with a NVIDIA GPU and Python 2 with about 70 GB disk space.

  1. Install PyTorch v0.3 with CUDA and Python 2.7.
  2. Install h5py.

Data Setup

All data should be downloaded to a 'data/' directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. The features are provided by and downloaded from the original authors' repo. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

Training

Simply run python main.py to start training. The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default flags should give you the result provided in the table above.

Owner
Hengyuan Hu
Hengyuan Hu
ONNX Command-Line Toolbox

ONNX Command Line Toolbox Aims to improve your experience of investigating ONNX models. Use it like onnx infershape /path/to/model.onnx. (See the usag

黎明灰烬 (王振华 Zhenhua WANG) 23 Nov 13, 2022
A Closer Look at Reference Learning for Fourier Phase Retrieval

A Closer Look at Reference Learning for Fourier Phase Retrieval This repository contains code for our NeurIPS 2021 Workshop on Deep Learning and Inver

Tobias Uelwer 1 Oct 28, 2021
UnpNet - Rethinking 3-D LiDAR Point Cloud Segmentation(IEEE TNNLS)

UnpNet Citation Please cite the following paper if you use this repository in your reseach. @article {PMID:34914599, Title = {Rethinking 3-D LiDAR Po

Shijie Li 4 Jul 15, 2022
CCCL: Contrastive Cascade Graph Learning.

CCGL: Contrastive Cascade Graph Learning This repo provides a reference implementation of Contrastive Cascade Graph Learning (CCGL) framework as descr

Xovee Xu 19 Dec 05, 2022
Mengzi Pretrained Models

中文 | English Mengzi 尽管预训练语言模型在 NLP 的各个领域里得到了广泛的应用,但是其高昂的时间和算力成本依然是一个亟需解决的问题。这要求我们在一定的算力约束下,研发出各项指标更优的模型。 我们的目标不是追求更大的模型规模,而是轻量级但更强大,同时对部署和工业落地更友好的模型。

Langboat 424 Jan 04, 2023
Contains code for Deep Kernelized Dense Geometric Matching

DKM - Deep Kernelized Dense Geometric Matching Contains code for Deep Kernelized Dense Geometric Matching We provide pretrained models and code for ev

Johan Edstedt 83 Dec 23, 2022
Ranking Models in Unlabeled New Environments (iccv21)

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

14 Dec 17, 2021
Multi-robot collaborative exploration and mapping through Voronoi partition and DRL in unknown environment

Voronoi Multi_Robot Collaborate Exploration Introduction In the unknown environment, the cooperative exploration of multiple robots is completed by Vo

PeaceWord 6 Nov 22, 2022
minimizer-space de Bruijn graphs (mdBG) for whole genome assembly

rust-mdbg: Minimizer-space de Bruijn graphs (mdBG) for whole-genome assembly rust-mdbg is an ultra-fast minimizer-space de Bruijn graph (mdBG) impleme

Barış Ekim 148 Dec 01, 2022
Julia package for multiway (inverse) covariance estimation.

TensorGraphicalModels TensorGraphicalModels.jl is a suite of Julia tools for estimating high-dimensional multiway (tensor-variate) covariance and inve

Wayne Wang 3 Sep 23, 2022
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
General Assembly Capstone: NBA Game Predictor

Project 6: Predicting NBA Games Problem Statement Can I predict the results of NBA games from the back-half of a season from the opening half of the s

Adam Muhammad Klesc 1 Jan 14, 2022
Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

MarioNette | Webpage | Paper | Video MarioNette: Self-Supervised Sprite Learning Dmitriy Smirnov, Michaël Gharbi, Matthew Fisher, Vitor Guizilini, Ale

Dima Smirnov 28 Nov 18, 2022
Point detection through multi-instance deep heatmap regression for sutures in endoscopy

Suture detection PyTorch This repo contains the reference implementation of suture detection model in PyTorch for the paper Point detection through mu

artificial intelligence in the area of cardiovascular healthcare 3 Jul 16, 2022
Generative Query Network (GQN) in PyTorch as described in "Neural Scene Representation and Rendering"

Update 2019/06/24: A model trained on 10% of the Shepard-Metzler dataset has been added, the following notebook explains the main features of this mod

Jesper Wohlert 313 Dec 27, 2022
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

Sun Yi 201 Nov 21, 2022
Open CV - Convert a picture to look like a cartoon sketch in python

Use the video https://www.youtube.com/watch?v=k7cVPGpnels for initial learning.

Sammith S Bharadwaj 3 Jan 29, 2022
1st place solution to the Satellite Image Change Detection Challenge hosted by SenseTime

1st place solution to the Satellite Image Change Detection Challenge hosted by SenseTime

Lihe Yang 209 Jan 01, 2023
PyTorch implementaton of our CVPR 2021 paper "Bridging the Visual Gap: Wide-Range Image Blending"

Bridging the Visual Gap: Wide-Range Image Blending PyTorch implementaton of our CVPR 2021 paper "Bridging the Visual Gap: Wide-Range Image Blending".

Chia-Ni Lu 69 Dec 20, 2022
🎃 Core identification module of AI powerful point reading system platform.

ppReader-Kernel Intro Core identification module of AI powerful point reading system platform. Usage 硬件: Windows10、GPU:nvdia GTX 1060 、普通RBG相机 软件: con

CrashKing 1 Jan 11, 2022