X-modaler is a versatile and high-performance codebase for cross-modal analytics.

Overview

X-modaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.

The original paper can be found here.

Installation

See installation instructions.

Requiremenets

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.8 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
  • fvcore
  • pytorch_transformers
  • jsonlines
  • pycocotools

Getting Started

See Getting Started with X-modaler

Training & Evaluation in Command Line

We provide a script in "train_net.py", that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.

To train a model(e.g., UpDown) with "train_net.py", first setup the corresponding datasets following datasets, then run:

# Teacher Force
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown.yaml

# Reinforcement Learning
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown_rl.yaml

Model Zoo and Baselines

A large set of baseline results and trained models are available here.

Image Captioning
Attention Show, attend and tell: Neural image caption generation with visual attention ICML 2015
LSTM-A3 Boosting image captioning with attributes ICCV 2017
Up-Down Bottom-up and top-down attention for image captioning and visual question answering CVPR 2018
GCN-LSTM Exploring visual relationship for image captioning ECCV 2018
Transformer Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning ACL 2018
Meshed-Memory Meshed-Memory Transformer for Image Captioning CVPR 2020
X-LAN X-Linear Attention Networks for Image Captioning CVPR 2020
Video Captioning
MP-LSTM Translating Videos to Natural Language Using Deep Recurrent Neural Networks NAACL HLT 2015
TA Describing Videos by Exploiting Temporal Structure ICCV 2015
Transformer Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning ACL 2018
TDConvED Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning AAAI 2019
Vision-Language Pretraining
Uniter UNITER: UNiversal Image-TExt Representation Learning ECCV 2020
TDEN Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network AAAI 2021

Image Captioning on MSCOCO (Cross-Entropy Loss)

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
LSTM-A3 GoogleDrive 75.3 59.0 45.4 35.0 26.7 55.6 107.7 19.7
Attention GoogleDrive 76.4 60.6 46.9 36.1 27.6 56.6 113.0 20.4
Up-Down GoogleDrive 76.3 60.3 46.6 36.0 27.6 56.6 113.1 20.7
GCN-LSTM GoogleDrive 76.8 61.1 47.6 36.9 28.2 57.2 116.3 21.2
Transformer GoogleDrive 76.4 60.3 46.5 35.8 28.2 56.7 116.6 21.3
Meshed-Memory GoogleDrive 76.3 60.2 46.4 35.6 28.1 56.5 116.0 21.2
X-LAN GoogleDrive 77.5 61.9 48.3 37.5 28.6 57.6 120.7 21.9
TDEN GoogleDrive 75.5 59.4 45.7 34.9 28.7 56.7 116.3 22.0

Image Captioning on MSCOCO (CIDEr Score Optimization)

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
LSTM-A3 GoogleDrive 77.9 61.5 46.7 35.0 27.1 56.3 117.0 20.5
Attention GoogleDrive 79.4 63.5 48.9 37.1 27.9 57.6 123.1 21.3
Up-Down GoogleDrive 80.1 64.3 49.7 37.7 28.0 58.0 124.7 21.5
GCN-LSTM GoogleDrive 80.2 64.7 50.3 38.5 28.5 58.4 127.2 22.1
Transformer GoogleDrive 80.5 65.4 51.1 39.2 29.1 58.7 130.0 23.0
Meshed-Memory GoogleDrive 80.7 65.5 51.4 39.6 29.2 58.9 131.1 22.9
X-LAN GoogleDrive 80.4 65.2 51.0 39.2 29.4 59.0 131.0 23.2
TDEN GoogleDrive 81.3 66.3 52.0 40.1 29.6 59.8 132.6 23.4

Video Captioning on MSVD

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
MP-LSTM GoogleDrive 77.0 65.6 56.9 48.1 32.4 68.1 73.1 4.8
TA GoogleDrive 80.4 68.9 60.1 51.0 33.5 70.0 77.2 4.9
Transformer GoogleDrive 79.0 67.6 58.5 49.4 33.3 68.7 80.3 4.9
TDConvED GoogleDrive 81.6 70.4 61.3 51.7 34.1 70.4 77.8 5.0

Video Captioning on MSR-VTT

Name Model [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
MP-LSTM GoogleDrive 73.6 60.8 49.0 38.6 26.0 58.3 41.1 5.6
TA GoogleDrive 74.3 61.8 50.3 39.9 26.4 59.4 42.9 5.8
Transformer GoogleDrive 75.4 62.3 50.0 39.2 26.5 58.7 44.0 5.9
TDConvED GoogleDrive 76.4 62.3 49.9 38.9 26.3 59.0 40.7 5.7

Visual Question Answering

Name Model Overall Yes/No Number Other
Uniter GoogleDrive 70.1 86.8 53.7 59.6
TDEN GoogleDrive 71.9 88.3 54.3 62.0

Caption-based image retrieval on Flickr30k

Name Model R1 R5 R10
Uniter GoogleDrive 61.6 87.7 92.8
TDEN GoogleDrive 62.0 86.6 92.4

Visual commonsense reasoning

Name Model Q -> A QA -> R Q -> AR
Uniter GoogleDrive 73.0 75.3 55.4
TDEN GoogleDrive 75.0 76.5 57.7

License

X-modaler is released under the Apache License, Version 2.0.

Citing X-modaler

If you use X-modaler in your research, please use the following BibTeX entry.

@inproceedings{Xmodaler2021,
  author =       {Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei},
  title =        {X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics},
  booktitle =    {Proceedings of the 29th ACM international conference on Multimedia},
  year =         {2021}
}
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

7.7k Jan 03, 2023
PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

43 Nov 19, 2022
Sequence-tagging using deep learning

Classification using Deep Learning Requirements PyTorch version = 1.9.1+cu111 Python version = 3.8.10 PyTorch-Lightning version = 1.4.9 Huggingface

Vineet Kumar 2 Dec 20, 2022
PyTorch DepthNet Training on Still Box dataset

DepthNet training on Still Box Project page This code can replicate the results of our paper that was published in UAVg-17. If you use this repo in yo

Clément Pinard 115 Nov 21, 2022
Understanding Convolutional Neural Networks from Theoretical Perspective via Volterra Convolution

nnvolterra Run Code Compile first: make compile Run all codes: make all Test xconv: make npxconv_test MNIST dataset needs to be downloaded, converted

1 May 24, 2022
This is just a funny project that we want to see AutoEncoder (AE) can actually work to enhance the features we want

Funny_muscle_enhancer :) 1.Discription: This is just a funny project that we want to see AutoEncoder (AE) can actually work on the some features. We w

Jing-Yao Chen (Jacob) 8 Oct 01, 2022
This repository implements Douzero's interface to IGCA.

douzero-interface-for-ICGA This repository implements Douzero's interface to ICGA. ./douzero: This directory stores Doudizhu AI projects. ./interface:

zhanggenjin 4 Aug 07, 2022
RP-GAN: Stable GAN Training with Random Projections

RP-GAN: Stable GAN Training with Random Projections This repository contains a reference implementation of the algorithm described in the paper: Behna

Ayan Chakrabarti 20 Sep 18, 2021
This is a library for training and applying sparse fine-tunings with torch and transformers.

This is a library for training and applying sparse fine-tunings with torch and transformers. Please refer to our paper Composable Sparse Fine-Tuning f

Cambridge Language Technology Lab 37 Dec 30, 2022
Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design This repository contains code for the paper: JA

Aspuru-Guzik group repo 55 Nov 29, 2022
RID-Noise: Towards Robust Inverse Design under Noisy Environments

This is code of RID-Noise. Reproduce RID-Noise Results Toy tasks Please refer to the notebook ridnoise.ipynb to view experiments on three toy tasks. B

Thyrix 2 Nov 23, 2022
[NeurIPS2021] Code Release of K-Net: Towards Unified Image Segmentation

K-Net: Towards Unified Image Segmentation Introduction This is an official release of the paper K-Net:Towards Unified Image Segmentation. K-Net will a

Wenwei Zhang 423 Jan 02, 2023
This folder contains the python code of UR5E's advanced forward kinematics model.

This folder contains the python code of UR5E's advanced forward kinematics model. By entering the angle of the joint of UR5e, the detailed coordinates of up to 48 points around the robot arm can be c

Qiang Wang 4 Sep 17, 2022
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022) https://arxiv.org/abs/2203.09388 Jianqi Ma, Zheto

MA Jianqi, shiki 104 Jan 05, 2023
Node for thenewboston digital currency network.

Project setup For project setup see INSTALL.rst Community Join the community to stay updated on the most recent developments, project roadmaps, and ra

thenewboston 27 Jul 08, 2022
Deep Learning Visuals contains 215 unique images divided in 23 categories

Deep Learning Visuals contains 215 unique images divided in 23 categories (some images may appear in more than one category). All the images were originally published in my book "Deep Learning with P

Daniel Voigt Godoy 1.3k Dec 28, 2022
Official PyTorch implementation of the paper: Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting.

Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting Official PyTorch implementation of the paper: Improving Graph Neural Net

Giorgos Bouritsas 58 Dec 31, 2022
Supervised Contrastive Learning for Product Matching

Contrastive Product Matching This repository contains the code and data download links to reproduce the experiments of the paper "Supervised Contrasti

Web-based Systems Group @ University of Mannheim 18 Dec 10, 2022
PerfFuzz: Automatically Generate Pathological Inputs for C/C++ programs

PerfFuzz Performance problems in software can arise unexpectedly when programs are provided with inputs that exhibit pathological behavior. But how ca

Caroline Lemieux 125 Nov 18, 2022
Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

BERT Got a Date: Introducing Transformers to Temporal Tagging Satya Almasian*, Dennis Aumiller*, and Michael Gertz Heidelberg University Contact us vi

54 Dec 04, 2022