The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Overview

CrossFormer

This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Introduction

Existing vision transformers fail to build attention among objects/features of different scales (cross-scale attention), while such ability is very important to visual tasks. CrossFormer is a versatile vision transformer which solves this problem. Its core designs contain Cross-scale Embedding Layer (CEL), Long-Short Distance Attention (L/SDA), which work together to enable cross-scale attention.

CEL blends every input embedding with multiple-scale features. L/SDA split all embeddings into several groups, and the self-attention is only computed within each group (embeddings with the same color border belong to the same group.).

Further, we also propose a dynamic position bias (DPB) module, which makes the effective yet inflexible relative position bias apply to variable image size.

Now, experiments are done on four representative visual tasks, i.e., image classification, objection detection, and instance/semantic segmentation. Results show that CrossFormer outperforms existing vision transformers in these tasks, especially in dense prediction tasks (i.e., object detection and instance/semantic segmentation). We think it is because image classification only pays attention to one object and large-scale features, while dense prediction tasks rely more on cross-scale attention.

Prerequisites

  1. Libraries (Python3.6-based)
pip3 install numpy scipy Pillow pyyaml torch==1.7.0 torchvision==0.8.1 timm==0.3.2
  1. Dataset: ImageNet

  2. Requirements for detection/instance segmentation and semantic segmentation are listed here: detection/README.md or segmentation/README.md

Getting Started

Training

## There should be two directories under the path_to_imagenet: train and validation

## CrossFormer-T
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-S
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/small_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-B
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/base_patch4_group7_224.yaml 
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-L
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/large_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

Testing

## Take CrossFormer-T as an example
python -u -m torch.distributed.launch --nproc_per_node 1 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --eval --resume path_to_crossformer-t.pth

Training scripts for objection detection: detection/README.md.

Training scripts for semantic segmentation: segmentation/README.md.

Results

Image Classification

Models trained on ImageNet-1K and evaluated on its validation set. The input image size is 224 x 224.

Architectures Params FLOPs Accuracy Models
ResNet-50 25.6M 4.1G 76.2% -
RegNetY-8G 39.0M 8.0G 81.7% -
CrossFormer-T 27.8M 2.9G 81.5% Google Drive/BaiduCloud, key: nkju
CrossFormer-S 30.7M 4.9G 82.5% Google Drive/BaiduCloud, key: fgqj
CrossFormer-B 52.0M 9.2G 83.4% Google Drive/BaiduCloud, key: 7md9
CrossFormer-L 92.0M 16.1G 84.0% TBD

More results compared with other vision transformers can be seen in the paper.

Objection Detection & Instance Segmentation

Models trained on COCO 2017. Backbones are initialized with weights pre-trained on ImageNet-1K.

Backbone Detection Head Learning Schedule Params FLOPs box AP mask AP
ResNet-101 RetinaNet 1x 56.7M 315.0G 38.5 -
CrossFormer-S RetinaNet 1x 40.8M 282.0G 44.4 -
CrossFormer-B RetinaNet 1x 62.1M 389.0G 46.2 -
ResNet-101 Mask-RCNN 1x 63.2M 336.0G 40.4 36.4
CrossFormer-S Mask-RCNN 1x 50.2M 301.0G 45.4 41.4
CrossFormer-B Mask-RCNN 1x 71.5M 407.9G 47.2 42.7

More results and pretrained models for objection detection: detection/README.md.

Semantic Segmentation

Models trained on ADE20K. Backbones are initialized with weights pre-trained on ImageNet-1K.

Backbone Segmentation Head Iterations Params FLOPs IOU MS IOU
CrossFormer-S FPN 80K 34.3M 209.8G 46.4 -
CrossFormer-B FPN 80K 55.6M 320.1G 48.0 -
CrossFormer-L FPN 80K 95.4M 482.7G 49.1 -
ResNet-101 UPerNet 160K 86.0M 1029.G 44.9 -
CrossFormer-S UPerNet 160K 62.3M 979.5G 47.6 48.4
CrossFormer-B UPerNet 160K 83.6M 1089.7G 49.7 50.6
CrossFormer-L UPerNet 160K 125.5M 1257.8G 50.4 51.4

MS IOU means IOU with multi-scale testing.

More results and pretrained models for semantic segmentation: segmentation/README.md.

Citing Us

@article{crossformer2021,
  title     = {CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention},
  author    = {Wenxiao Wang and Lu Yao and Long Chen and Deng Cai and Xiaofei He and Wei Liu},
  journal   = {CoRR},
  volume    = {abs/2108.00154},
  year      = {2021},
}

Acknowledgement

Part of the code of this repository refers to Swin Transformer.

Owner
cheerss
cheerss
Repository for MuSiQue: Multi-hop Questions via Single-hop Question Composition

🎵 MuSiQue: Multi-hop Questions via Single-hop Question Composition This is the repository for our paper "MuSiQue: Multi-hop Questions via Single-hop

21 Jan 02, 2023
A Data Annotation Tool for Semantic Segmentation, Object Detection and Lane Line Detection.(In Development Stage)

Data-Annotation-Tool How to Run this Tool? To run this software, follow the steps: git clone https://github.com/Autonomous-Car-Project/Data-Annotation

TiVRA AI 13 Aug 18, 2022
A list of multi-task learning papers and projects.

This page contains a list of papers on multi-task learning for computer vision. Please create a pull request if you wish to add anything. If you are interested, consider reading our recent survey pap

svandenh 297 Dec 17, 2022
"Projelerle Yapay Zeka Ve Bilgisayarlı Görü" Kitabımın projeleri

"Projelerle Yapay Zeka Ve Bilgisayarlı Görü" Kitabımın projeleri Bu Github Reposundaki tüm projeler; kaleme almış olduğum "Projelerle Yapay Zekâ ve Bi

Ümit Aksoylu 4 Aug 03, 2022
a pytorch implementation of auto-punctuation learned character by character

Learning Auto-Punctuation by Reading Engadget Articles Link to Other of my work 🌟 Deep Learning Notes: A collection of my notes going from basic mult

Ge Yang 137 Nov 09, 2022
Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Evolutionary Scale Modeling This repository contains code and pre-trained weights for Transformer protein language models from Facebook AI Research, i

Meta Research 1.6k Jan 09, 2023
Example of semantic segmentation in Keras

keras-semantic-segmentation-example Example of semantic segmentation in Keras Single class example: Generated data: random ellipse with random color o

53 Mar 23, 2022
This code is a near-infrared spectrum modeling method based on PCA and pls

Nirs-Pls-Corn This code is a near-infrared spectrum modeling method based on PCA and pls 近红外光谱分析技术属于交叉领域,需要化学、计算机科学、生物科学等多领域的合作。为此,在(北邮邮电大学杨辉华老师团队)指导下

Fu Pengyou 6 Dec 17, 2022
Level Based Customer Segmentation

level_based_customer_segmentation Level Based Customer Segmentation Persona Veri Seti kullanılarak müşteri segmentasyonu yapılmıştır. KOLONLAR : PRICE

Buse Yıldırım 6 Dec 21, 2021
An AI made using artificial intelligence (AI) and machine learning algorithms (ML) .

DTech.AIML An AI made using artificial intelligence (AI) and machine learning algorithms (ML) . This is created by help of some members in my team and

1 Jan 06, 2022
Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

26 Dec 07, 2022
Conditional Gradients For The Approximately Vanishing Ideal

Conditional Gradients For The Approximately Vanishing Ideal Code for the paper: Wirth, E., and Pokutta, S. (2022). Conditional Gradients for the Appro

IOL Lab @ ZIB 0 May 25, 2022
Awesome Long-Tailed Learning

Awesome Long-Tailed Learning This repo pays specially attention to the long-tailed distribution, where labels follow a long-tailed or power-law distri

Stomach_ache 284 Jan 06, 2023
An image classification app boilerplate to serve your deep learning models asap!

Image 🖼 Classification App Boilerplate Have you been puzzled by tons of videos, blogs and other resources on the internet and don't know where and ho

Smaranjit Ghose 27 Oct 06, 2022
Classification Modeling: Probability of Default

Credit Risk Modeling in Python Introduction: If you've ever applied for a credit card or loan, you know that financial firms process your information

Aktham Momani 2 Nov 07, 2022
Differentiable Quantum Chemistry (only Differentiable Density Functional Theory and Hartree Fock at the moment)

DQC: Differentiable Quantum Chemistry Differentiable quantum chemistry package. Currently only support differentiable density functional theory (DFT)

75 Dec 02, 2022
RuleBERT: Teaching Soft Rules to Pre-Trained Language Models

RuleBERT: Teaching Soft Rules to Pre-Trained Language Models (Paper) (Slides) (Video) RuleBERT is a pre-trained language model that has been fine-tune

16 Aug 24, 2022
Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

ZJUNLP 68 Dec 28, 2022
Portfolio asset allocation strategies: from Markowitz to RNNs

Portfolio asset allocation strategies: from Markowitz to RNNs Research project to explore different approaches for optimal portfolio allocation starti

Luigi Filippo Chiara 1 Feb 05, 2022
This toolkit provides codes to download and pre-process the SLUE datasets, train the baseline models, and evaluate SLUE tasks.

slue-toolkit We introduce Spoken Language Understanding Evaluation (SLUE) benchmark. This toolkit provides codes to download and pre-process the SLUE

ASAPP Research 39 Sep 21, 2022