An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Last update: Jan 06, 2023

Related tags

Text Data & NLP openai-clip

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

This repo aims at providing an easy to use and efficient code for extracting image & text features using the official OpenAI CLIP models, which is also optimized for multi processing GPU feature extraction.

The official OpenAI CLIP repo only supports extracting global visual features, while the local grid features from CLIP visual models may also contain more detailed semantic information which can benefit multi visual-and-language downstream tasks[1][2]. As an alternative, this repo encapsulates minor-modified CLIP code in order to extract not only global visual features but also local grid visual features from different CLIP visual models. What's more, this repo is designed in a user-friendly object-oriented fashion, allowing users to add their customized visual_extractor classes easily to customize different input and output grid resolution.

To verify the semantic meaning of the extracted visual grid features, we also applied the extracted visual grid features of MSCOCO images from different official CLIP models for standard image captioning task. We got comparable or superior results in transformer baseline easily without hard-tuning hyperparameters, via simply replacing BUTD features with the extracted CLIP gird features. Surprisingly, we got 116.9 CIDEr score in teacher-forcing setting and 129.6 in reinforcement learning setting when using ViT-B/32 CLIP model, which conflicts with the experiment results in CLIP-ViL paper [1] where the authors observed that CLIP-ViT-B with grid features has a large performance degradation compared with other models (58.0 CIDEr score in CLIP-ViT-B_Transformer setting in COCO Captioning).

We provide supported CLIP models, results on MSCOCO image captioning, and other information below. We believe this repo can facilitate the usage of powerful CLIP models.

1. Supported CLIP Models

Currently this repo supports five visual extractor settings, including three standard pipelines used in official OpenAI CLIP repo and two additional customized pipelines supporting larger input resolution. You can refer to this file for more details about customizing your own visual backbones for different input and output resolution. In order to imporve training efficiency in image captioning task, we apply AvgPool2d to the output feature map to reduce grid features size in some settings without large performance degradation. We will support more CLIP models in the future.

	Visual Backbone	CLIP Model	Input Resolution	Output Resolution	Feature Map Downsample	Grid Feature Shape	Global Feature Shape
Standard	RN101	RN101	224 x 224	7 x 7	None	49 x 2048	1 x 512
	ViT-B/32	ViT-B/32	224 x 224	7 x 7	None	49 x 768	1 x 512
	ViT-B/16	ViT-B/16	224 x 224	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 768	1 x 512
Customized	RN101_448	RN101	448 x 448	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 2048	1 x 512
Customized	ViT-B/32_448	ViT-B/32	448 x 448	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 768	1 x 512

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

We ran image captioning experiments on X-modaler with the extracted CLIP grid features. We easily got comparable or superior results in transformer baseline using the default hyperparameters in X-modaler's transformer baseline, except for SOLVER.BASE_LR=2e-4 in ViT-B/16 and ViT-B/32_448 teacher-forcing settings. The performance of transformer baseline using BUTD features is taken from X-modaler's paper.

2.1 Teacher-forcing

Name	[email protected]	[email protected]	[email protected]	[email protected]	METEOR	ROUGE-L	CIDEr-D	SPICE
BUTD	76.4	60.3	46.5	35.8	28.2	56.7	116.6	21.3
RN101	77.3	61.3	47.7	36.9	28.7	57.5	120.6	21.8
ViT-B/32	76.4	60.3	46.5	35.6	28.1	56.7	116.9	21.2
ViT-B/16	78.0	62.1	48.2	37.2	28.8	57.6	122.3	22.1
RN101_448	78.1	62.3	48.4	37.5	29.0	58.0	122.9	22.2
ViT-B/32_448	75.8	59.6	45.9	35.1	27.8	56.3	114.2	21.0

2.2 Self-critical Reinforcement Learning

Name	[email protected]	[email protected]	[email protected]	[email protected]	METEOR	ROUGE-L	CIDEr-D	SPICE
BUTD	80.5	65.4	51.1	39.2	29.1	58.7	130.0	23.0
RN101	-	-	-	-	-	-	-	-
ViT-B/32	79.9	64.6	50.4	38.5	29.0	58.6	129.6	22.8
ViT-B/16	82.0	67.3	53.1	41.1	29.9	59.8	136.6	23.8
RN101_448	81.7	66.9	52.6	40.5	29.9	59.7	136.1	23.9
ViT-B/32_448	-	-	-	-	-	-	-	-

3. Get Started

Note: The extracted feature files are compatible with X-modaler, where you can setup your experiments about cross-modal analytics conveniently.

3.1 Requirements

PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
timm ≥ 0.4.5

3.2 Examples

Use CLIP ViT-B/32 model to extract global textual features of MSCOCO sentences from dataset_coco.json in Karpathy's released annotations.

CUDA_VISIBLE_DEVICES=0 python3 clip_textual_feats.py \
    --anno dataset_coco.json \
    --output_dir ${TXT_OUTPUT_DIR} \
    --model_type_or_path 'ViT-B/32'

Use CLIP ViT-B/16 model to extract global and grid visual features of MSCOCO images.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'ViT-B/16' \
    --model_type_or_path 'ViT-B/16'

Use CLIP RN101 model to extract global and grid visual features of MSCOCO images.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101' \
    --model_type_or_path 'RN101'

Use CLIP RN101 model to extract global and grid visual features of MSCOCO images with 448 x 448 resolution.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101_448' \
    --model_type_or_path 'RN101'

3.3 Speeding up feature extraction with Multiple GPUs

You can run the same script with same input list (i.e. --image_list or --anno) on another GPU (that can be from a different machine, provided that the disk to output the features is shared between the machines). The script will create a new feature extraction process that will only focus on processing the items that have not been processed yet, without overlapping with the other extraction process already running.

4. License

MIT

5. Acknowledgement

This repo used resources from OpenAI CLIP, timm, CLIP-ViL, X-modaler. The repo is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

6. References

[1] How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer. In Arxiv2021.

[2] In Defense of Grid Features for Visual Question Answering. Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen. In CVPR2020.

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Related tags

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

1. Supported CLIP Models

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

2.1 Teacher-forcing

2.2 Self-critical Reinforcement Learning

3. Get Started

3.1 Requirements

3.2 Examples

3.3 Speeding up feature extraction with Multiple GPUs

4. License

5. Acknowledgement

6. References

Owner

Jianjie(JJ) Luo

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

spaCy plugin for Transformers , Udify, ELmo, etc.

Semantic search for quotes.

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Contains links to publicly available datasets for modeling health outcomes using speech and language.

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

ACL'22: Structured Pruning Learns Compact and Accurate Models

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Codes for coreference-aware machine reading comprehension

Journey is a NLP-Powered Developer assistant

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

AI-Broad-casting - AI Broad casting with python

Utilities for preprocessing text for deep learning with Keras

It analyze the sentiment of the user, whether it is postive or negative.

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

Crowd sourced training data for Rasa NLU models

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP