Codes and scripts for "Explainable Semantic Space by Grounding Languageto Vision with Cross-Modal Contrastive Learning"

Related tags

Deep LearningVG-Bert
Overview

Visually Grounded Bert Language Model

This repository is the official implementation of Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning.

To cite this work: Zhang, Y., Choi, M., Han, K., & Liu, Z. Explainable Semantic Space by Grounding Language toVision with Cross-Modal Contrastive Learning. (accepted by Neurips 2021).

Abstract

In natural language processing, most models try to learn semantic representa- tions merely from texts. The learned representations encode the “distributional semantics” but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes “grounded semantics” for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the model’s language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. This semantic space manifests principal dimensions explainable with human intuition and neurobiological knowl- edge. Word embeddings in this semantic space are predictive of human-defined norms of semantic features and are segregated into perceptually distinctive clusters. Furthermore, the visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations.

Requirements

The model was trained with Python version: 3.7.4.

numpy==1.17.2, scipy==1.3.1, torch==1.7.1, torchvision==0.8.2, transformers==4.2.1, Pillow==6.2.0, tokenizers==0.9.4

Training

To train the model(s) in the paper, run the following commands:

stage-1: visual stream pretraining:

python visual_stream_pretraining.py \
-a vgg16_attention \
--pretrained \
--batch-size 200 \
--use-position learn \
--lr 0.01 \
/directory-to-ImageNet-dataset/ImageNet2012/

stage-2: two-stream grounding on MS COCO dataset with corss-modal contrastive loss:

python run.py \
--stage two_stream_pretraining \
--data-train /directory-to-COCO-dataset/COCO_train2017.json \
--data-val /directory-to-COCO-dataset/COCO_val2017.json \
--optim adam \
--learning-rate 5e-5 \
--batch-size 180 \
--n_epochs 100 \
--pretrained-vgg \
--image-model VGG16_Attention \
--use-position learn \
--language-model Bert_base \
--embedding-dim 768 \
--sigma 0.1 \
--dropout-rate 0.3 \
--base-model-learnable-layers 8 \
--load-pretrained /directory-to-pretrain-image-model/ \
--exp-dir /output-directory/two-stream-pretraining/

stage-3: visual relational grounding on Visual Genome dataset:

python run.py \
--stage relational_grounding \
--data-train /directory-to-Visual-Genome-dataset/VG_train_dataset_finalized.json \
--data-val /directory-to-Visual-Genome-dataset/VG_val_dataset_finalized.json \
--optim adam \
--learning-rate 1e-5 \
--batch-size 180 \
--n_epochs 150 \
--pretrained-vgg \
--image-model VGG16_Attention \
--use-position learn \
--language-model Bert_object \
--num-heads 8 \
--embedding-dim 768 \
--subspace-dim 32 \
--relation-num 115 \
--temperature 1 \
--dropout-rate 0.1 \
--base-model-learnable-layers 2 \
--load-pretrained /directory-to-pretrain-two-stream-model/ \
--exp-dir /output-directory/relational-grounding/

Transfer learning for cross-modal image search:

python transfer_cross_modal_retrieval.py \
--data-train /directory-to-COCO-dataset/COCO_train2017.json \
--data-val /directory-to-COCO-dataset/COCO_val2017.json \
--optim adam \
--learning-rate 5e-5 \
--batch-size 300 \
--n_epochs 150 \
--pretrained-vgg \
--image-model VGG16_Attention \
--use-position absolute_learn \
--language-model Bert_object \
--num-heads 8 \
--embedding-dim 768 \
--subspace-dim 32 \
--relation-num 115 \
--sigma 0.1 \
--dropout-rate 0.1 \
--load-pretrained /directory-to-pretrain-two-stream-model/ \
--exp-dir /output-directory/cross-modal-retrieval/

Evaluation

We include the jupyter notebook scripts for running evaluation tasks in our paper. See README in evaluation/.

Rank1 Conversation Emotion Detection Task

Rank1-Conversation_Emotion_Detection_Task accuracy macro-f1 recall 0.826 0.7544 0.719 基于预训练模型和时序预测模型的对话情感探测任务 1 摘要 针对对话情感探测任务,本文将其分为文本分类和时间序列预测两个子任务,分

Yuchen Han 2 Nov 28, 2021
Notification Triggers for Python

Notipyer Notification triggers for Python Send async email notifications via Python. Get updates/crashlogs from your scripts with ease. Installation p

Chirag Jain 17 May 16, 2022
Code for "Hierarchical Skills for Efficient Exploration" HSD-3 Algorithm and Baselines

Hierarchical Skills for Efficient Exploration This is the source code release for the paper Hierarchical Skills for Efficient Exploration. It contains

Facebook Research 38 Dec 06, 2022
In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy

PixMix Introduction In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard te

Andy Zou 79 Dec 30, 2022
Implementation of algorithms for continuous control (DDPG and NAF).

DEPRECATION This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

Ilya Kostrikov 288 Dec 31, 2022
An implementation of the BADGE batch active learning algorithm.

Batch Active learning by Diverse Gradient Embeddings (BADGE) An implementation of the BADGE batch active learning algorithm. Details are provided in o

125 Dec 24, 2022
Vision Transformer and MLP-Mixer Architectures

Vision Transformer and MLP-Mixer Architectures Update (2.7.2021): Added the "When Vision Transformers Outperform ResNets..." paper, and SAM (Sharpness

Google Research 6.4k Jan 04, 2023
Plugin adapted from Ultralytics to bring YOLOv5 into Napari

napari-yolov5 Plugin adapted from Ultralytics to bring YOLOv5 into Napari. Training and detection can be done using the GUI. Training dataset must be

2 May 05, 2022
PyKaldi GOP-DNN on Epa-DB

PyKaldi GOP-DNN on Epa-DB This repository has the tools to run a PyKaldi GOP-DNN algorithm on Epa-DB, a database of non-native English speech by Spani

18 Dec 14, 2022
This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack".

Generative Dynamic Patch Attack This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack". Requirements PyTo

Xiang Li 8 Nov 17, 2022
Tensorflow implementation of "BEGAN: Boundary Equilibrium Generative Adversarial Networks"

BEGAN in Tensorflow Tensorflow implementation of BEGAN: Boundary Equilibrium Generative Adversarial Networks. Requirements Python 2.7 or 3.x Pillow tq

Taehoon Kim 922 Dec 21, 2022
Pytorch code for semantic segmentation using ERFNet

ERFNet (PyTorch version) This code is a toolbox that uses PyTorch for training and evaluating the ERFNet architecture for semantic segmentation. For t

Edu 394 Jan 01, 2023
CURL: Contrastive Unsupervised Representations for Reinforcement Learning

CURL Rainbow Status: Archive (code is provided as-is, no updates expected) This is an implementation of CURL: Contrastive Unsupervised Representations

Aravind Srinivas 46 Dec 12, 2022
Task-related Saliency Network For Few-shot learning

Task-related Saliency Network For Few-shot learning This is an official implementation in Tensorflow of TRSN. Abstract An essential cue of human wisdo

1 Nov 18, 2021
Pytorch implementation of

EfficientTTS Unofficial Pytorch implementation of "EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture"(arXiv). Disclaimer: Somebo

Liu Songxiang 109 Nov 16, 2022
Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning This repository is the official implementation of CARE.

ChongjianGE 89 Dec 02, 2022
Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation

Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation Requirements This repository needs mmsegmentation Training To train

20 May 28, 2022
Help you understand Manual and w/ Clutch point while driving.

简体中文 forza_auto_gear forza_auto_gear is a tool for Forza Horizon 5. It will help us understand the best gear shift point using Manual or w/ Clutch in

15 Oct 08, 2022
Local Multi-Head Channel Self-Attention for FER2013

LHC-Net Local Multi-Head Channel Self-Attention This repository is intended to provide a quick implementation of the LHC-Net and to replicate the resu

12 Jan 04, 2023
This repository contains the source code for the paper First Order Motion Model for Image Animation

!!! Check out our new paper and framework improved for articulated objects First Order Motion Model for Image Animation This repository contains the s

13k Jan 09, 2023