VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Last update: Dec 28, 2022

Overview

VisualGPT

Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Main Architecture of Our VisualGPT

Download the GPT-2 pretrained weights

curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin

Enviroment setup

Clone the repository and create the visualgpt conda environmnet

conda env create -f environment.yml
conda activate visualgpt

Then download spacy data

python -m spacy download en

Data preparation

We provide the COCO dataset for downloading. Please download the annotations file annotations.zip and extract it. and coco_detections.hdf5, in which the data is stored in a where key is the image id and value is a tensor (N, 2048). N it the number of detections

code structure

create the log folder mkdir logs and start the training

Train the model

python train_visualGPT.py --batch_size 50 --head 12 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw  --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data

Acknowledgement

This code used resources from Meshed Memory Transformer and Transformers

Please cite our paper from the following bibtex

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Related tags

Overview

VisualGPT

Main Architecture of Our VisualGPT

Download the GPT-2 pretrained weights

Enviroment setup

Data preparation

code structure

Train the model

Acknowledgement

Owner

Vision CAIR Research Group, KAUST

Semi-supevised Semantic Segmentation with High- and Low-level Consistency

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21)

AI drive app that can help user become beautiful.

Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

Bunch of different tools which helps visualizing and annotating images for semantic/instance segmentation tasks

An Exact Solver for Semi-supervised Minimum Sum-of-Squares Clustering

Train an RL agent to execute natural language instructions in a 3D Environment (PyTorch)

Code for CPM-2 Pre-Train

A Peer-to-peer Platform for Secure, Privacy-preserving, Decentralized Data Science

[CVPR 2022] Structured Sparse R-CNN for Direct Scene Graph Generation

Official implementation of SIGIR'2021 paper: "Sequential Recommendation with Graph Neural Networks".

Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

The official implementation of ICCV paper "Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds".

Visualize Camera's Pose Using Extrinsic Parameter by Plotting Pyramid Model on 3D Space

GANimation: Anatomically-aware Facial Animation from a Single Image (ECCV'18 Oral) [PyTorch]

Introduction to CPM

House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent for Professional Architects

Paddle pit - Rethinking Spatial Dimensions of Vision Transformers

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

Official source code to CVPR'20 paper, "When2com: Multi-Agent Perception via Communication Graph Grouping"