Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Overview

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Download links and PyTorch implementation of "Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision", ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, Noah Snavely ICCV 2021

Project Page | Paper

drawing

The WikiScenes Dataset

  1. Image and Textual Descriptions: WikiScenes contains 63K images with captions of 99 cathedrals. We provide two versions for download:

    • Low-res version used in our experiments (maximum width set to 200[px], aspect ratio fixed): (1.9GB .zip file)
    • Higher-res version (maximum longer dimension set to 1200[px], aspect ratio fixed): (19.4GB .zip file)

    Licenses for the images are provided here: (LicenseInfo.json file)

    Data Structure

    WikiScenes is organized recursively, following the tree structure in Wikimedia. Each semantic category (e.g. cathedral) contains the following recursive structure:

    ----0 (e.g., "milano cathedral duomo milan milano italy italia")
    --------0 (e.g., "Exterior of the Duomo (Milan)")
    ----------------0 (e.g., "Duomo (Milan) in art - exterior")
    ----------------1
    ----------------...
    ----------------K0-0
    ----------------category.json
    ----------------pictures (contains all pictures in current hierarchy level)
    --------1
    --------...
    --------K0
    --------category.json
    --------pictures (contains all pictures in current hierarchy level)
    ----1
    ----2
    ----...
    ----N
    ----category.json
    

    category.json is a dictionary of the following format:

    {
        "max_index": SUB-DIR-NUMBER
        "pairs" :    {
                        CATEGORY-NAME: SUB-DIR-NAME
                    }
        "pictures" : {
                        PICTURE-NAME: {
                                            "caption": CAPTION-DATA,
                                            "url": URL-DATA,
                                            "properties": PROPERTIES
                                    }
                    }
    }
    

    where:

    1. SUB-DIR-NUMBER is the total number of subcategories
    2. CATEGORY-NAME is the name of the category (e.g., "milano cathedral duomo milan milano italy italia")
    3. SUB-DIR-NAME is the name of the sub-folder (e.g., "0")
    4. PICTURE-NAME is the name of the jpg file located within the pictures folder
    5. CAPTION-DATA contains the caption and URL contains the url from which the image was scraped.
    6. PROPERTIES is a list of properties pre-computed for the image-caption pair (e.g. estimated language of caption).
  2. Keypoint correspondences: We also provide keypoint correspondences between pixels of images from the same landmark: (982MB .zip file)

    Data Structure

     {
         "image_id" : {
                         "kp_id": (x, y),
                     }
     }
    

    where:

    1. image_id is the id of each image.
    2. kp_id is the id of keypoints, which is unique across the whole dataset.
    3. (x, y) the location of the keypoint in this image.
  3. COLMAP reconstructions: We provide the full 3D models used for computing keypoint correspondences: (1GB .zip file)

    To view these models, download and install COLMAP. The reconstructions are organized by landmarks. Each landmark folder contains all the reconstructions associated with that landmark. Each reconstruction contains 3 files:

    1. points3d.txt that contains one line of data for each 3D point associated with the reconstruction. The format for each point is: POINT3D_ID, X, Y, Z, R, G, B, ERROR, TRACK[] as (IMAGE_ID, POINT2D_IDX).
    2. images.txt that contains two lines of data for each image associated with the reconstruction. The format of the first line is: IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME. The format of the second line is: POINTS2D[] as (X, Y, POINT3D_ID)
    3. cameras.txt that contains one line of data for each camera associated with the reconstruction according to the following format: CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]

    Please refer to COLMAP's tutorial for further instructions on how to view these reconstructions.

  4. Companion datasets for additional landmark categories: We provide download links for additional category types:

    Synagogues

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

    Mosques

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

Reproducing Results

  1. Minimum requirements. This project was originally developed with Python 3.6, PyTorch 1.0 and CUDA 9.0. The training requires at least one Titan X GPU (12Gb memory) .

  2. Setup your Python environment. Clone the repository and install the dependencies:

    conda create -n <environment_name> --file requirements.txt -c conda-forge/label/cf202003
    conda activate <environment_name>
    conda install scikit-learn=0.21
    pip install opencv-python
    
  3. Download the dataset. Download the data as detailed above, unzip and place as follows: Image and textual descriptions in <project>/data/ and the correspondence file in <project>.

  4. Download pre-trained models. Download the initial weights (pre-trained on ImageNet) for the backbone model and place in <project>/models/weights/.

    Backbone Initial Weights Comments
    ResNet50 resnet50-19c8e357.pth PyTorch official model
  5. Train on the WikiScenes dataset. See instructions below. Note that the first run always takes longer for pre-processing. Some computations are cached afterwards.

Training, Inference and Evaluation

The directory launch contains template bash scripts for training, inference and evaluation.

Training. For each run, you need to specify the names of two variables, bash EXP and bash RUN_ID. Running bash EXP=wiki RUN_ID=v01 ./launch/run_wikiscenes_resnet50.sh will create a directory ./logs/wikiscenes_corr/wiki/ with tensorboard events and saved snapshots in ./snapshots/wikiscenes_corr/wiki/v01.

Inference.

If you want to do inference with our pre-trained model, please make a directory and put the model there.

    mkdir -p ./snapshots/wikiscenes_corr/final/ours

Download our validation set, and unzip it.

    unzip val_seg.zip

run sh ./launch/infer_val_wikiscenes.sh to predict masks. You can find the predicted masks in ./logs/masks.

If you want to evaluate you own models, you will also need to specify:

  • EXP and RUN_ID you used for training;
  • OUTPUT_DIR the path where to save the masks;
  • SNAPSHOT specifies the model suffix in the format e000Xs0.000;

Evaluation. To compute IoU of the masks, run sh ./launch/eval_seg.sh.

Pre-trained model

For testing, we provide our pre-trained ResNet50 model:

Backbone Link
ResNet50 model_enc_e024Xs-0.800.pth (157M)

Datasheet

We provide a datasheet for our dataset here.

License

The images in our dataset are provided by Wikimedia Commons under various free licenses. These licenses permit the use, study, derivation, and redistribution of these images—sometimes with restrictions, e.g. requiring attribution and with copyleft. We provide full license text and attribution for all images, make no modifications to any, and release these images under their original licenses. The associated captions are provided as a part of unstructured text in Wikimedia Commons, with rights to the original writers under the CC BY-SA 3.0 license. We modify these (as specified in our paper) and release such derivatives under the same license. We provide the rest of our dataset under a CC BY-NC-SA 4.0 license.

Citation

@inproceedings{Wu2021Towers,
 title={Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision},
 author={Wu, Xiaoshi and Averbuch-Elor, Hadar and Sun, Jin and Snavely, Noah},
 booktitle={ICCV},
 year={2021}
}

Acknowledgement

Our code is based on the implementation of Single-Stage Semantic Segmentation from Image Labels

Owner
Blakey Wu
Blakey Wu
Layered Neural Atlases for Consistent Video Editing

Layered Neural Atlases for Consistent Video Editing Project Page | Paper This repository contains an implementation for the SIGGRAPH Asia 2021 paper L

Yoni Kasten 353 Dec 27, 2022
Hcpy - Interface with Home Connect appliances in Python

Interface with Home Connect appliances in Python This is a very, very beta inter

Trammell Hudson 116 Dec 27, 2022
Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

PixelTransformer Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation". Project Page Installation Please insta

Shubham Tulsiani 24 Dec 17, 2022
R-Drop: Regularized Dropout for Neural Networks

R-Drop: Regularized Dropout for Neural Networks R-drop is a simple yet very effective regularization method built upon dropout, by minimizing the bidi

756 Dec 27, 2022
Edge-aware Guidance Fusion Network for RGB-Thermal Scene Parsing

EGFNet Edge-aware Guidance Fusion Network for RGB-Thermal Scene Parsing Dataset and Results Test maps: 百度网盘 提取码:zust Citation @ARTICLE{ author={Zhou,

ShaohuaDong 10 Dec 08, 2022
Supplementary code for the AISTATS 2021 paper "Matern Gaussian Processes on Graphs".

Matern Gaussian Processes on Graphs This repo provides an extension for gpflow with Matérn kernels, inducing variables and trainable models implemente

41 Dec 17, 2022
StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

3k Jan 08, 2023
Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution Figure: Example visualization of the method and baseline as a

Oliver Hahn 16 Dec 23, 2022
GNN-based Recommendation Benchmark

GRecX A Fair Benchmark for GNN-based Recommendation Homepage and Documentation Homepage: Documentation: Paper: GRecX: An Efficient and Unified Benchma

73 Oct 17, 2022
Implement of homography net by pytorch

HomographyNet Implement of homography net by pytorch Brief Introduction This project is based on the work Homography-Net: @article{detone2016deep, t

ronghao_CN 4 May 19, 2022
QuakeLabeler is a Python package to create and manage your seismic training data, processes, and visualization in a single place — so you can focus on building the next big thing.

QuakeLabeler Quake Labeler was born from the need for seismologists and developers who are not AI specialists to easily, quickly, and independently bu

Hao Mai 15 Nov 04, 2022
[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

CoCLR: Self-supervised Co-Training for Video Representation Learning This repository contains the implementation of: InfoNCE (MoCo on videos) UberNCE

Tengda Han 271 Jan 02, 2023
[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

DAB-DETR This is the official pytorch implementation of our ICLR 2022 paper DAB-DETR. Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi

336 Dec 25, 2022
PyTorch implementation of adversarial patch

adversarial-patch PyTorch implementation of adversarial patch This is an implementation of the Adversarial Patch paper. Not official and likely to hav

Jamie Hayes 172 Nov 29, 2022
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

MOSES 656 Dec 29, 2022
Video2x - A lossless video/GIF/image upscaler achieved with waifu2x, Anime4K, SRMD and RealSR.

Official Discussion Group (Telegram): https://t.me/video2x A Discord server is also available. Please note that most developers are only on Telegram.

K4YT3X 5.9k Dec 31, 2022
It's a powerful version of linebot

CTPS-FINAL Linbot-sever.py 主程式 Algorithm.py 推薦演算法,媒合餐廳端資料與顧客端資料 config.ini 儲存 channel-access-token、channel-secret 資料 Preface 生活在成大將近4年,我們每天的午餐時間看著形形色色

1 Oct 17, 2022
A Python module for parallel optimization of expensive black-box functions

blackbox: A Python module for parallel optimization of expensive black-box functions What is this? A minimalistic and easy-to-use Python module that e

Paul Knysh 426 Dec 08, 2022
Betafold - AlphaFold with tunings

BetaFold We (hegelab.org) craeted this standalone AlphaFold (AlphaFold-Multimer,

2 Aug 11, 2022
Spectralformer: Rethinking hyperspectral image classification with transformers

Spectralformer: Rethinking hyperspectral image classification with transformers Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza

Danfeng Hong 102 Dec 29, 2022