PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Last update: Dec 31, 2022

Overview

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is the PyTorch code of the BLIP paper. The code has been tested on PyTorch 1.10. To install the dependencies, run

pip install -r requirements.txt

Catalog:

Inference demo
Pre-trained and finetuned checkpoints
Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
Pre-training code
Download of bootstrapped pre-training datasets

Inference demo:

Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for: (1) image captioning, (2) open-ended visual question answering, (3) multimodal / unimodal feature extraction.

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo

Pre-trained checkpoints:

Num. pre-train images	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
14M	Download	-	-
129M	Download	Download	Download

Finetuned checkpoints:

Task	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
Image-Text Retrieval (COCO)	Download	-	Download
Image-Text Retrieval (Flickr30k)	Download	-	Download
Image Captioning (COCO)	-	Download	Download
VQA	Download	Download	-
NLVR2	Download	-	-

Image-Text Retrieval:

Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco \
--evaluate

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco

Image-Text Captioning:

Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate

To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py

VQA:

Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_vqa.py

NLVR2:

Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
To evaluate the finetuned BLIP model, run

python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py

Pre-train:

Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

Pre-training datasets download:

We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.

Image source	Filtered web caption	Filtered synthetic caption	Filtered synthetic caption by ViT-L
CC3M+CC12M+SBU	Download	Download	Download
LAION115M	Download	Download	Download

Citation

If you find this code to be useful for your research, please consider citing.

@misc{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      eprint={2201.12086},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Related tags

Overview

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Inference demo:

Pre-trained checkpoints:

Finetuned checkpoints:

Image-Text Retrieval:

Image-Text Captioning:

VQA:

NLVR2:

Pre-train:

Pre-training datasets download:

Citation

Acknowledgement

Owner

Salesforce

PaRT: Parallel Learning for Robust and Transparent AI

Deep Sea Treasure Environment for Multi-Objective Optimization Research

this is a lite easy to use virtual keyboard project for anyone to use

A configurable, tunable, and reproducible library for CTR prediction

Keras like implementation of Deep Learning architectures from scratch using numpy.

EgGateWayGetShell py脚本

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

PyTorch implementation of UPFlow (unsupervised optical flow learning)

The Fundamental Clustering Problems Suite (FCPS) summaries 54 state-of-the-art clustering algorithms, common cluster challenges and estimations of the number of clusters as well as the testing for cluster tendency.

A package for music online and offline rhythmic information analysis including music Beat, downbeat, tempo and meter tracking.

Source code of article "Towards Toxic and Narcotic Medication Detection with Rotated Object Detector"

Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

Barbershop: GAN-based Image Compositing using Segmentation Masks (SIGGRAPH Asia 2021)

[NeurIPS'20] Multiscale Deep Equilibrium Models

PyTorch implementation of the TTC algorithm

Pytorch code for "Text-Independent Speaker Verification Using 3D Convolutional Neural Networks".

Robot Reinforcement Learning on the Constraint Manifold

Demo code for ICCV 2021 paper "Sensor-Guided Optical Flow"

Diabetes-Feature-Engineering - A machine learning model that can predict whether people have diabetes when their characteristics are specified

CKD - Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding