Official implement of "CAT: Cross Attention in Vision Transformer".

Last update: Dec 15, 2022

Related tags

Overview

CAT: Cross Attention in Vision Transformer

This is official implement of "CAT: Cross Attention in Vision Transformer".

Abstract

Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones.

CAT achieves strong performance on COCO object detection(implemented with mmdectection) and ADE20K semantic segmentation(implemented with mmsegmantation).

Pretrained Models and Results on ImageNet-1K

name	resolution	[email protected]	[email protected]	#params	FLOPs	model	log
CAT-T	224x224	80.3	95.0	17M	2.8G	github	github
CAT-S^*	224x224	81.8	95.6	37M	5.9G	github	github
CAT-B	224x224	82.8	96.1	52M	8.9G	github	github
CAT-T-v2	224x224	81.7	95.5	36M	3.9G	Coming	Coming

Note: ^* indicates new version of model and log.

Models and Results on Object Detection (COCO 2017 val)

Backbone	Method	pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	model	log
CAT-S	Mask R-CNN⁺	ImageNet-1K	1x	41.6	38.6	57M	295G	github	github
CAT-B	Mask R-CNN⁺	ImageNet-1K	1x	41.8	38.7	71M	356G	github	github
CAT-S	FCOS	ImageNet-1K	1x	40.0	-	45M	245G	github	github
CAT-B	FCOS	ImageNet-1K	1x	41.0	-	59M	303G	github	github
CAT-S	ATSS	ImageNet-1K	1x	42.0	-	45M	243G	github	github
CAT-B	ATSS	ImageNet-1K	1x	42.5	-	59M	303G	github	github
CAT-S	RetinaNet	ImageNet-1K	1x	40.1	-	47M	276G	github	github
CAT-B	RetinaNet	ImageNet-1K	1x	41.4	-	62M	337G	github	github
CAT-S	Cascade R-CNN	ImageNet-1K	1x	44.1	-	82M	270G	github	github
CAT-B	Cascade R-CNN	ImageNet-1K	1x	44.8	-	96M	330G	github	github
CAT-S	Cascade R-CNN⁺	ImageNet-1K	1x	45.2	-	82M	270G	github	github
CAT-B	Cascade R-CNN⁺	ImageNet-1K	1x	46.3	-	96M	330G	github	github

Note: ⁺ indicates multi-scale training.

Models and Results on Semantic Segmentation (ADE20K val)

Backbone	Method	pretrain	Crop Size	Lr Schd	mIoU	mIoU (ms+flip)	#params	FLOPs	model	log
CAT-S	Semantic FPN	ImageNet-1K	512x512	80K	40.6	42.1	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	80K	42.2	43.6	55M	276G	github	github
CAT-S	Semantic FPN	ImageNet-1K	512x512	160K	42.2	42.8	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	160K	43.2	44.9	55M	276G	github	github

Citing CAT

You can cite the paper as:

@article{lin2021cat,
  title={CAT: Cross Attention in Vision Transformer},
  author={Hezheng Lin and Xing Cheng and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Qing Song and Wei Yuan},
  journal={arXiv preprint arXiv:2106.05786},
  year={2021}
}

Started

Please refer to get_started.

Acknowledgement

Our implementation is mainly based on Swin.

You might also like...

Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

70 Dec 12, 2022

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Shufflenet-v2-Pytorch Introduction This is a Pytorch implementation of faceplusplus's ShuffleNet-v2. For details, please read the following papers:

423 Dec 7, 2022

implement of SwiftNet:Real-time Video Object Segmentation

SwiftNet The official PyTorch implementation of SwiftNet:Real-time Video Object Segmentation, which has been accepted by CVPR2021. Requirements Python

64 Dec 14, 2022

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

15 Dec 27, 2022

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

A pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021" 1. Notes This is a pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in

91 Dec 26, 2022

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

disclaimer: this code is modified from pytorch-tutorial Image classification with synthetic gradient in Pytorch I implement the Decoupled Neural Inter

114 Dec 22, 2022

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Demonstration of OpenVINO techniques - Model-division and a simplest-way to support custom layers Description: Model Optimizer in Intel(r) OpenVINO(tm

12 Nov 9, 2022

Implement some metaheuristics and cost functions

Metaheuristics This repot implement some metaheuristics and cost functions. Metaheuristics JAYA Implement Jaya optimizer without constraints. Cost fun

1 Mar 23, 2022

Official implement of "CAT: Cross Attention in Vision Transformer".

Related tags

Overview

CAT: Cross Attention in Vision Transformer

Abstract

Pretrained Models and Results on ImageNet-1K

Models and Results on Object Detection (COCO 2017 val)

Models and Results on Semantic Segmentation (ADE20K val)

Citing CAT

Started

Acknowledgement

You might also like...

Implement A3C for Mujoco gym envs

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

implement of SwiftNet:Real-time Video Object Segmentation

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Implement some metaheuristics and cost functions

Releases(v1.0)

v1.0(Jun 5, 2022)

Owner

Architecture Patterns with Python (TDD, DDD, EDM)

The implemention of Video Depth Estimation by Fusing Flow-to-Depth Proposals

This repository contains source code for the Situated Interactive Language Grounding (SILG) benchmark

3D2Unet: 3D Deformable Unet for Low-Light Video Enhancement (PRCV2021)

Deep and online learning with spiking neural networks in Python

EdMIPS: Rethinking Differentiable Search for Mixed-Precision Neural Networks

Official PyTorch(Geometric) implementation of DPGNN(DPGCN) in "Distance-wise Prototypical Graph Neural Network for Node Imbalance Classification"

A general framework for inferring CNNs efficiently. Reduce the inference latency of MobileNet-V3 by 1.3x on an iPhone XS Max without sacrificing accuracy.

Regularizing Nighttime Weirdness: Efficient Self-supervised Monocular Depth Estimation in the Dark (ICCV 2021)

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

Rotary Transformer

Using pytorch to implement unet network for liver image segmentation.

Systematic generalisation with group invariant predictions

Keep CALM and Improve Visual Feature Attribution

A simple image/video to Desmos graph converter run locally

The code of Zero-shot learning for low-light image enhancement based on dual iteration

K Closest Points and Maximum Clique Pruning for Efficient and Effective 3D Laser Scan Matching (To appear in RA-L 2022)

Deep Learning Head Pose Estimation using PyTorch.

Keras community contributions

iNAS: Integral NAS for Device-Aware Salient Object Detection