MlTr: Multi-label Classification with Transformer

This is official implement of "MlTr: Multi-label Classification with Transformer".

Abstract

The task of multi-label image classification is to recognize all the object labels presented in an image. Though advancing for years, small objects, similar objects and objects with high conditional probability are still the main bottlenecks of previous convolutional neural network(CNN) based models, limited by convolutional kernels' representational capacity. Recent vision transformer networks utilize the self-attention mechanism to extract the feature of pixel granularity, which expresses richer local semantic information, while is insufficient for mining global spatial dependence. In this paper, we point out the three crucial problems that CNN-based methods encounter and explore the possibility of conducting specific transformer modules to settle them. We put forward a Multi-label Transformer architecture(MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention, particularly improving the performance of multi-label image classification tasks. The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE with 88.5%, 95.8%, and 65.5% respectively.

Pretrained model (Results on MS-COCO2014)

name	resolution	map	params(M)	model	log
mltr-s	224x224	81.9	33	coming soon	coming soon
mltr-m	384x384	86.8	62	coming soon	coming soon
mltr-l	384x384	88.5	108	coming soon	coming soon

Citing artical

Pleadse cite this article as:

@misc{cheng2021mltr,
      title={MlTr: Multi-label Classification with Transformer}, 
      author={Xing Cheng and Hezheng Lin and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Nian Shi and Honglin Liu},
      year={2021},
      eprint={2106.06195},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Started

Please refer to get_started.

MlTr: Multi-label Classification with Transformer

Related tags

Overview

MlTr: Multi-label Classification with Transformer

Abstract

Pretrained model (Results on MS-COCO2014)

Citing artical

Started

Owner

程星

SAT: 2D Semantics Assisted Training for 3D Visual Grounding, ICCV 2021 (Oral)

Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage

Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations

Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion

[ICML 2021] Towards Understanding and Mitigating Social Biases in Language Models

Official code for UnICORNN (ICML 2021)

Video Corpus Moment Retrieval with Contrastive Learning (SIGIR 2021)

Official Chainer implementation of GP-GAN: Towards Realistic High-Resolution Image Blending (ACMMM 2019, oral)

Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

GyroSPD: Vector-valued Distance and Gyrocalculus on the Space of Symmetric Positive Definite Matrices

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition in CVPR19

Combinatorially Hard Games where the levels are procedurally generated

(IEEE TIP 2021) Regularized Densely-connected Pyramid Network for Salient Instance Segmentation

BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Deep Learning GPU Training System

AnimationKit: AI Upscaling & Interpolation using Real-ESRGAN+RIFE

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

Neural Cellular Automata + CLIP

DeepI2I: Enabling Deep Hierarchical Image-to-Image Translation by Transferring from GANs