Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences forImage-Text Retrieval

Last update: Nov 07, 2022

Related tags

Overview

NSGDC

Some codes in this repo are copied/modified from opensource implementations made available by UNITER, PyTorch, HuggingFace, OpenNMT, and Nvidia. The image features are extracted using BUTD.

Requirements

This is following UNITER. We provide Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Image-Text Retrieval

Download Data

bash scripts/download_itm.sh $PATH_TO_STORAGE

Launch the Docker Container

# docker image should be automatically pulled
source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
$PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

In case you would like to reproduce the whole preprocessing pipeline.

The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Image-Text Retrieval (Flickr30k)

# Train wit the base setting
bash run_cmds/tran_pnsgd_base_flickr.sh
bash run_cmds/tran_pnsgd2_base_flickr.sh

# Train wit the large setting
bash run_cmds/tran_pnsgd_large_flickr.sh
bash run_cmds/tran_pnsgd2_large_flickr.sh

Image-Text Retrieval (COCO)

# Train wit the base setting
bash run_cmds/tran_pnsgd_base_coco.sh
bash run_cmds/tran_pnsgd2_base_coco.sh

# Train wit the large setting
bash run_cmds/tran_pnsgd_large_coco.sh
bash run_cmds/tran_pnsgd2_large_coco.sh

Run Inference

bash run_cmds/inf_nsgd.sh

Results

Our models achieve the following performance.

MS-COCO

Model	Image-to-Text			Text-to-Image
Model	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]
NSGDC-Base	66.6	88.6	94.0	51.6	79.1	87.5
NSGDC-Large	67.8	89.6	94.2	53.3	80.0	88.0

Flickr30K

Model	Image-to-Text			Text-to-Image
Model	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]
NSGDC-Base	87.9	98.1	99.3	74.5	93.3	96.3
NSGDC-Large	90.6	98.8	99.1	77.3	94.3	97.3

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences forImage-Text Retrieval

Related tags

Overview

NSGDC

Requirements

Image-Text Retrieval

Download Data

Launch the Docker Container

Image-Text Retrieval (Flickr30k)

Image-Text Retrieval (COCO)

Run Inference

Results

MS-COCO

Flickr30K

Owner

Zhihao Fan

Face Mask Detection on Image and Video using tensorflow and keras

The Fundamental Clustering Problems Suite (FCPS) summaries 54 state-of-the-art clustering algorithms, common cluster challenges and estimations of the number of clusters as well as the testing for cluster tendency.

Tools for the Cleveland State Human Motion and Control Lab

Deep learning library for solving differential equations and more

Wordplay, an artificial Intelligence based crossword puzzle solver.

A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation

Apollo optimizer in tensorflow

Code for "The Box Size Confidence Bias Harms Your Object Detector"

A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

Multiview Dataset Toolkit

Code for SALT: Stackelberg Adversarial Regularization, EMNLP 2021.

Classify bird species based on their songs using SIamese Networks and 1D dilated convolutions.

Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

The first dataset on shadow generation for the foreground object in real-world scenes.

Generic U-Net Tensorflow implementation for image segmentation

Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer.

Coarse implement of the paper "A Simultaneous Denoising and Dereverberation Framework with Target Decoupling", On DNS-2020 dataset, the DNSMOS of first stage is 3.42 and second stage is 3.47.

GANsformer: Generative Adversarial Transformers Drew A

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)