project page for VinVL

Last update: Jan 09, 2023

Related tags

Deep Learning VinVL

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

02/28/2021: Project page built.

Introduction

This repository is the project page for VinVL, containing necessary instructions to reproduce the results presented in the paper. We presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model (code), the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR (code), and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks.

Performance

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
Metric	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	M	C	S	C	S	test-std	test-P	test-std
SoTA_S	39.2	68.0	56.6	84.5	38.9	29.2	129.8	22.4	61.5	9.2	70.92	58.80	63.17
SoTA_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.67	79.30	61.62
SoTA_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	-	-	74.93	81.47	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
VinVL_B	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	76.12	83.08	64.65
VinVL_L	58.8	83.5	75.4	92.9	41.0	31.1	140.9	25.2	-	-	76.62	83.98	-
gain	1.3	0.7	1.9	0.6	-0.7	0.5	0.9	0.7	5.9	0.7	1.69	2.51	1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Leaderboard results

VinVL has achieved top-position in several VL leaderboards, including Visual Question Answering (VQA), Microsoft COOC Image Captioning, Novel Object Captioning (nocaps), and Visual Commonsense Reasoning (VCR).

Comparison with image features from bottom-up and top-down model (code).

We observe uniform improvements on seven VL tasks by replacing visual features from bottom-up and top-down model with ours. The NoCaps baseline is from VIVO, and our results are obtained by directly replacing the visual features. The baselines for rest tasks are from OSCAR, and our results are obtained by replacing the visual features and performing OSCAR+ pre-training. All models are BERT-Base size. As analyzed in Section 5.2 in the VinVL paper, the new visual features contributes 95% of the improvement.

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
metric	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	M	C	S	C	S	test-std	test-P	test-std
bottom-up and top-down model	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.16	78.07	61.62
VinVL (ours)	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	75.95	83.08	64.65
gain	4.1	2.4	4.6	1.5	0.4	1.2	3.0	2.3	5.9	0.7	2.79	4.71	3.03

Please see the following two figures for visual comparison.

Source code

Pretrained Faster-RCNN model and feature extraction

The pretrained X152-C4 object-attribute detection can be downloaded here. With code from our Scene Graph Benchmark Repo (to be released soon), one can extract features with following command:

python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True TEST.OUTPUT_FEATURE True

The output feature will be encoded as base64.

Find more pretrained models in DOWNLOAD.

Pre-exacted Image Features

For ease-of-use, we make pretrained features and predictions available for all pretraining datasets and downstream tasks. Please find the instructions to download them in DOWNLOAD.

Pretraind Oscar+ models and VL downstream tasks

The code to produce all vision-language results (both pretraining and downstream task finetuning) can be found in our OSCAR repo. One can find the model zoo for vision-language tasks here.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

project page for VinVL

Related tags

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

Introduction

Performance

Leaderboard results

Comparison with image features from bottom-up and top-down model (code).

Source code

Pretrained Faster-RCNN model and feature extraction

Pre-exacted Image Features

Pretraind Oscar+ models and VL downstream tasks

Citations

Owner

CondenseNet: Light weighted CNN for mobile devices

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Cave Generation using metaballs in Blender. Originally created by sdfgeoff, Edited by Myself (Archie Jaskowicz).

Efficiently computes derivatives of numpy code.

Using machine learning to predict and analyze high and low reader engagement for New York Times articles posted to Facebook.

PyTorch implementation of "PatchGame: Learning to Signal Mid-level Patches in Referential Games" to appear in NeurIPS 2021

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

This repository provides an unified frameworks to train and test the state-of-the-art few-shot font generation (FFG) models.

DIRL: Domain-Invariant Representation Learning

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Picasso: a methods for embedding points in 2D in a way that respects distances while fitting a user-specified shape.

Realtime micro-expression recognition using OpenCV and PyTorch

Label Studio is a multi-type data labeling and annotation tool with standardized output format

Repo for our ICML21 paper Unsupervised Learning of Visual 3D Keypoints for Control

Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it

[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

MNIST, but with Bezier curves instead of pixels

Enigma-Plus - Python based Enigma machine simulator with some extra features

《Deep Single Portrait Image Relighting》(ICCV 2019)