A concise but complete implementation of CLIP with various experimental improvements from recent papers

Last update: Dec 26, 2022

Overview

x-clip (wip)

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Install

$ pip install x-clip

Usage

import torch
from x_clip import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 10000,
    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
    num_visual_tokens = 512,
    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8,
    use_all_token_embeds = True   # whether to use fine-grained contrastive learning (FILIP)
)

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = clip(text, images, text_mask = mask, return_loss = True)
loss.backward()

Citations

@misc{radford2021learning,
    title   = {Learning Transferable Visual Models From Natural Language Supervision}, 
    author  = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
    year    = {2021},
    eprint  = {2103.00020},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@misc{yao2021filip,
    title   = {FILIP: Fine-grained Interactive Language-Image Pre-Training}, 
    author  = {Lewei Yao and Runhui Huang and Lu Hou and Guansong Lu and Minzhe Niu and Hang Xu and Xiaodan Liang and Zhenguo Li and Xin Jiang and Chunjing Xu},
    year    = {2021},
    eprint  = {2111.07783},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

Model forward outputs to text/image similarity score

Any insight on how to take the image/text embeddings (or nominal model forward output) to achieve a simple similarity score as done in the huggingface implementation? HF example here

In the original paper I see the dot products of the image/text encoder outputs were used, but here I was having troubles with the dimensions on the outputs.

opened by paulcjh 12

Using different encoders in CLIP

Hi, I am wondering if it was possible to use different encoders in CLIP ? For images not using vit but resnet for example. And is it possible to replace the text encoder by a features encoder for example ? If I have a vector of features for a given image and I want to use x-clip how should I do that ? I have made a code example that doesnt seems to work, here is what I did:

import torch
from x_clip import CLIP
import torch.nn as nn
from torchvision import models

class Image_Encoder(torch.nn.Module):
    #output size is (bs,512)
    def __init__(self):
        super(Image_Encoder, self).__init__()
        self.model_pre = models.resnet18(pretrained=False)
        self.base=nn.Sequential(*list(self.model_pre.children()))
        self.base[0]=nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
        self.resnet=self.base[:-1]

    def forward(self, x):
        out=self.resnet(x).squeeze()
        return out


class features_encoder(torch.nn.Module):
    #output size is (bs,512)
    def __init__(self):
        super(features_encoder, self).__init__()
        self.model =nn.Linear(2048,512)

    def forward(self, x):
        out=self.model(x)
        return out

images_encoder=Image_Encoder()
features_encoder=features_encoder()

clip = CLIP(
    image_encoder = images_encoder,
    text_encoder = features_encoder,
    dim_image = 512,
    dim_text = 512,
    dim_latent = 512
)

features= torch.randn(4,2048)
images = torch.randn(4, 3, 256, 256)

loss = clip(features, images, return_loss = True)
loss.backward()

but I got the following error : forward() takes 2 positional arguments but 3 were given

Thanks

opened by ethancohen123 8

Visual ssl with channels different than 3

Hi, seems to be a bug when trying to use visual ssl with a different number of channel than 3 . I think the error came from the visual ssl type ~row 280 here:

#send a mock image tensor to instantiate parameters self.forward(torch.randn(1, 3, image_size, image_size))

opened by ethancohen123 4

Allow other types of visual SSL when initiating CLIP

In the following code as part of CLIP.__init__

        if use_visual_ssl:
            if visual_ssl_type == 'simsiam':
                ssl_type = SimSiam
            elif visual_ssl_type == 'simclr':
                ssl_type = partial(SimCLR, temperature = simclr_temperature)
            else:
                raise ValueError(f'unknown visual_ssl_type')

            self.visual_ssl = ssl_type(
                self.visual_transformer,
                image_size = visual_image_size,
                hidden_layer = visual_ssl_hidden_layer
            )

the visual self-supervised learning is hardcoded. I would suggest changing this to accept the visual SSL module as an argument when instantiating CLIP to allow flexibility in the same manner as it does for the image encoder and text encoder.

Example:

barlow = BarlowTwins(augmentatation_fns)
clip = CLIP(..., visual_ssl=barlow)

opened by Froskekongen 4

Extract Text and Image Latents

Hi, in the current implementation we can only extract text and image embedding (by set return_encodings=True) which are obtained before applying latent linear layers. Isn't it better to add an option to extract latent embeddings? Another importance of this is that with the current code, it is impossible to extract the similarity matrix between a batch of images and a batch of text.

opened by mmsamiei 2

NaN with mock data

Hi lucidrains,

Try this and it will NaN within 100 steps (latest Github code). The loss looks fine before NaN.

import torch
torch.backends.cudnn.allow_tf32 = True
torch.backends.cuda.matmul.allow_tf32 = True    
torch.backends.cudnn.benchmark = True

import random
import numpy as np
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

num_text_tokens = 10000
batch_sz = 12
text_seq_len = 256
visual_image_size = 256

# mock data

data_sz = 1000
all_text = torch.randint(0, num_text_tokens, (data_sz, text_seq_len)).cuda()
all_images = torch.randn(data_sz, 3, visual_image_size, visual_image_size).cuda()

text = torch.zeros((batch_sz, text_seq_len), dtype=torch.long).cuda()
images = torch.zeros((batch_sz, 3, visual_image_size, visual_image_size)).cuda()

##########################################################################################

import wandb
import datetime
wandb.init(project="Test", name=datetime.datetime.today().strftime('%Y-%m-%d-%H-%M-%S'), save_code=False)

from x_clip import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = num_text_tokens,
    text_enc_depth = 6,
    text_seq_len = text_seq_len,
    text_heads = 8,
    visual_enc_depth = 6,
    visual_image_size = visual_image_size,
    visual_patch_size = 32,
    visual_heads = 8,
    use_all_token_embeds = False,           # whether to use fine-grained contrastive learning (FILIP)
    decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
    extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
    use_visual_ssl = True,                  # whether to do self supervised learning on iages
    visual_ssl_type = 'simclr',             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
    use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)
    text_ssl_loss_weight = 0.05,            # weight for text MLM loss
    image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss
).cuda()

optimizer = torch.optim.Adam(clip.parameters(), lr=1e-4, betas=(0.9, 0.99))

for step in range(999999):
    for i in range(batch_sz):
        data_id = random.randrange(0, data_sz - 1)
        text[i] = all_text[data_id]
        images[i] = all_images[data_id]

    loss = clip(
        text,
        images,
        freeze_image_encoder = False,   # whether to freeze image encoder if using a pretrained image net, proposed by LiT paper
        return_loss = True              # needs to be set to True to return contrastive loss
    )
    clip.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(clip.parameters(), 1.0)
    optimizer.step()

    now_loss = loss.item()
    wandb.log({"loss": now_loss}, step = step)
    print(step, now_loss)

    if 'nan' in str(now_loss):
        break

opened by BlinkDL 1

Unable to train to convergence (small dataset)

Hi nice work with x-clip. Hoping to play around with it and eventually combine it into your DALLE2 work.

Currently having some trouble training on roughly 30k image-text pairs. Loss eventually goes negative and starts producing Nan's. I've dropped learning rate down (1e-4) and I'm clipping gradients (max_norm=0.5).

Any thoughts on what are sane training params/configs on such a small dataset using x-clip?

opened by jacobwjs 9

Releases(0.12.0)

0.12.0(Dec 2, 2022)

null
Source code(tar.gz)
Source code(zip)
0.11.0(Oct 16, 2022)

null
Source code(tar.gz)
Source code(zip)
0.10.0(Sep 14, 2022)

null
Source code(tar.gz)
Source code(zip)
0.9.0(Aug 17, 2022)

null
Source code(tar.gz)
Source code(zip)
0.8.4(Aug 5, 2022)

null
Source code(tar.gz)
Source code(zip)
0.8.3(Aug 4, 2022)

null
Source code(tar.gz)
Source code(zip)
0.8.2(Aug 3, 2022)

null
Source code(tar.gz)
Source code(zip)
0.8.1(Aug 3, 2022)

null
Source code(tar.gz)
Source code(zip)
0.8.0(Aug 3, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.4(Aug 3, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.3(Aug 3, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.2(Aug 3, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.1(Jul 30, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.7.0(Jun 23, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.6.1(May 24, 2022)

null
Source code(tar.gz)
Source code(zip)
0.6.1(May 24, 2022)

Source code(tar.gz)
Source code(zip)
0.6.0(May 24, 2022)

Source code(tar.gz)
Source code(zip)
0.5.1(Apr 29, 2022)

Source code(tar.gz)
Source code(zip)
0.5.0(Apr 15, 2022)

Source code(tar.gz)
Source code(zip)
0.4.6(Apr 13, 2022)

Source code(tar.gz)
Source code(zip)
0.4.5(Apr 13, 2022)

Source code(tar.gz)
Source code(zip)
0.4.4(Apr 13, 2022)

Source code(tar.gz)
Source code(zip)
0.4.3(Apr 12, 2022)

Source code(tar.gz)
Source code(zip)
0.4.2(Apr 12, 2022)

Source code(tar.gz)
Source code(zip)
0.4.1(Apr 12, 2022)

Source code(tar.gz)
Source code(zip)
0.4.0(Apr 6, 2022)

Source code(tar.gz)
Source code(zip)
0.3.0(Mar 1, 2022)

Source code(tar.gz)
Source code(zip)
0.2.4(Mar 1, 2022)

Source code(tar.gz)
Source code(zip)
0.2.3(Feb 5, 2022)

Source code(tar.gz)
Source code(zip)
0.2.2(Jan 27, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

wgan, wgan2(improved, gp), infogan, and dcgan implementation in lasagne, keras, pytorch

Generative Adversarial Notebooks Collection of my Generative Adversarial Network implementations Most codes are for python3, most notebooks works on C

1.5k Dec 16, 2022

Team Enigma at ArgMining 2021 Shared Task: Leveraging Pretrained Language Models for Key Point Matching

Team Enigma at ArgMining 2021 Shared Task: Leveraging Pretrained Language Models for Key Point Matching This is our attempt of the shared task on Quan

12 Jul 08, 2022

Complete system for facial identity system

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

4 May 02, 2022

Official release of MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer axriv: http://arxiv.org/abs/2112.13513

MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis This is the official page of the MSHT with its experimental script and records. We de

53 Dec 27, 2022

Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels Blazing fast x86-64 VM kernel fuzzing framework with performant VM reloads for Linux, MacOS an

541 Nov 27, 2022

Learn about quantum computing and algorithm on quantum computing

quantum_computing this repo contains everything i learn about quantum computing and algorithm on quantum computing what is aquantum computing quantum

8 Dec 25, 2022

Efficient and Scalable Physics-Informed Deep Learning and Scientific Machine Learning on top of Tensorflow for multi-worker distributed computing

Notice: Support for Python 3.6 will be dropped in v.0.2.1, please plan accordingly! Efficient and Scalable Physics-Informed Deep Learning Collocation-

74 Dec 09, 2022

Implemented fully documented Particle Swarm Optimization algorithm (basic model with few advanced features) using Python programming language

Implemented fully documented Particle Swarm Optimization (PSO) algorithm in Python which includes a basic model along with few advanced features such as updating inertia weight, cognitive, social lea

9 Nov 29, 2022

A simple baseline for the 2022 IEEE GRSS Data Fusion Contest (DFC2022)

DFC2022 Baseline A simple baseline for the 2022 IEEE GRSS Data Fusion Contest (DFC2022) This repository uses TorchGeo, PyTorch Lightning, and Segmenta

24 Nov 28, 2022

Nvidia Semantic Segmentation monorepo

Paper | YouTube | Cityscapes Score Pytorch implementation of our paper Hierarchical Multi-Scale Attention for Semantic Segmentation. Please refer to t

1.6k Jan 04, 2023

Free course that takes you from zero to Reinforcement Learning PRO 🦸🏻‍🦸🏽

The Hands-on Reinforcement Learning course 🚀 From zero to HERO 🦸🏻‍🦸🏽 Out of intense complexities, intense simplicities emerge. -- Winston Churchi

260 Dec 28, 2022

GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

MTV-TSA: Adaptable GAN Encoders for Image Reconstruction via Multi-type Latent Vectors with Two-scale Attentions. This is the official code release fo

37 Dec 24, 2022

Code artifacts for the submission "Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving Systems"

Code Artifacts Code artifacts for the submission "Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driv

2 Aug 24, 2022

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Related tags

Overview

x-clip (wip)

Install

Usage

Citations

Comments

Model forward outputs to text/image similarity score

Using different encoders in CLIP

Visual ssl with channels different than 3

Allow other types of visual SSL when initiating CLIP

Extract Text and Image Latents

NaN with mock data

Unable to train to convergence (small dataset)

Releases(0.12.0)

0.12.0(Dec 2, 2022)

0.11.0(Oct 16, 2022)

0.10.0(Sep 14, 2022)

0.9.0(Aug 17, 2022)

0.8.4(Aug 5, 2022)

0.8.3(Aug 4, 2022)

0.8.2(Aug 3, 2022)

0.8.1(Aug 3, 2022)

0.8.0(Aug 3, 2022)

0.7.4(Aug 3, 2022)

0.7.3(Aug 3, 2022)

0.7.2(Aug 3, 2022)

0.7.1(Jul 30, 2022)

v0.7.0(Jun 23, 2022)

v0.6.1(May 24, 2022)

0.6.1(May 24, 2022)

0.6.0(May 24, 2022)

0.5.1(Apr 29, 2022)

0.5.0(Apr 15, 2022)

0.4.6(Apr 13, 2022)

0.4.5(Apr 13, 2022)

0.4.4(Apr 13, 2022)

0.4.3(Apr 12, 2022)

0.4.2(Apr 12, 2022)

0.4.1(Apr 12, 2022)

0.4.0(Apr 6, 2022)

0.3.0(Mar 1, 2022)

0.2.4(Mar 1, 2022)

0.2.3(Feb 5, 2022)

0.2.2(Jan 27, 2022)

Owner

Phil Wang

wgan, wgan2(improved, gp), infogan, and dcgan implementation in lasagne, keras, pytorch

Team Enigma at ArgMining 2021 Shared Task: Leveraging Pretrained Language Models for Key Point Matching

Complete system for facial identity system

Official release of MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer axriv: http://arxiv.org/abs/2112.13513

Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

Learn about quantum computing and algorithm on quantum computing

Efficient and Scalable Physics-Informed Deep Learning and Scientific Machine Learning on top of Tensorflow for multi-worker distributed computing

Implemented fully documented Particle Swarm Optimization algorithm (basic model with few advanced features) using Python programming language

A simple baseline for the 2022 IEEE GRSS Data Fusion Contest (DFC2022)

Nvidia Semantic Segmentation monorepo

Free course that takes you from zero to Reinforcement Learning PRO 🦸🏻‍🦸🏽

GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

Code artifacts for the submission "Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving Systems"

Weakly Supervised 3D Object Detection from Point Cloud with Only Image Level Annotation

A library for optimization on Riemannian manifolds

Pytorch implementation of MixNMatch

An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

A deep learning network built with TensorFlow and Keras to classify gender and estimate age.

A cool little repl-based simulation written in Python

[WACV 2022] Contextual Gradient Scaling for Few-Shot Learning