GLIP: Grounded Language-Image Pre-training

Updates

12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857. Code and Model are under internal review and will release soon. Stay tuned!

11/23/2021: Project page built.

Introduction

This repository is the project page for GLIP, containing necessary instructions to reproduce the results presented in the paper. This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.

Supervised baselines on COCO object detection: Faster-RCNN w/ ResNet50 (40.2) or ResNet101 (42.0) from Detectron2, and DyHead w/ Swin-Tiny (49.7).

Citations

Please consider citing this paper if you use the code:

@inproceedings{harold_GLIP2021,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2021},
      booktitle={arXiv preprint arXiv:2112.03857},
}

GLIP: Grounded Language-Image Pre-training

Related tags

Overview

GLIP: Grounded Language-Image Pre-training

Updates

Introduction

Citations

Owner

Microsoft

PyGCL: A PyTorch Library for Graph Contrastive Learning

An NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.

⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

An official implementation of the paper Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

PixelPick This is an official implementation of the paper "All you need are a few pixels: semantic segmentation with PixelPick."

A list of all named GANs!

the code for our CVPR 2021 paper Bilateral Grid Learning for Stereo Matching Network [BGNet]

FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

Official implementation of "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer"

The GitHub repository for the paper: “Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction“.

PyTorch code for the "Deep Neural Networks with Box Convolutions" paper

Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation

Python Auto-ML Package for Tabular Datasets

Transformer in Computer Vision

Deep Learning Visuals contains 215 unique images divided in 23 categories

Python script that allows you to automatically setup your Growtopia server.

TensorFlow-LiveLessons - "Deep Learning with TensorFlow" LiveLessons

DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation

STARCH compuets regional extreme storm physical characteristics and moisture balance based on spatiotemporal precipitation data from reanalysis or climate model data.

A library for uncertainty quantification based on PyTorch