Official implement of "CAT: Cross Attention in Vision Transformer".

Related tags

Deep LearningCAT
Overview

CAT: Cross Attention in Vision Transformer

This is official implement of "CAT: Cross Attention in Vision Transformer".

Abstract

Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones.

CAT achieves strong performance on COCO object detection(implemented with mmdectection) and ADE20K semantic segmentation(implemented with mmsegmantation).

architecture

Pretrained Models and Results on ImageNet-1K

name resolution [email protected] [email protected] #params FLOPs model log
CAT-T 224x224 80.3 95.0 17M 2.8G github github
CAT-S* 224x224 81.8 95.6 37M 5.9G github github
CAT-B 224x224 82.8 96.1 52M 8.9G github github
CAT-T-v2 224x224 81.7 95.5 36M 3.9G Coming Coming

Note: * indicates new version of model and log.

Models and Results on Object Detection (COCO 2017 val)

Backbone Method pretrain Lr Schd box mAP mask mAP #params FLOPs model log
CAT-S Mask R-CNN+ ImageNet-1K 1x 41.6 38.6 57M 295G github github
CAT-B Mask R-CNN+ ImageNet-1K 1x 41.8 38.7 71M 356G github github
CAT-S FCOS ImageNet-1K 1x 40.0 - 45M 245G github github
CAT-B FCOS ImageNet-1K 1x 41.0 - 59M 303G github github
CAT-S ATSS ImageNet-1K 1x 42.0 - 45M 243G github github
CAT-B ATSS ImageNet-1K 1x 42.5 - 59M 303G github github
CAT-S RetinaNet ImageNet-1K 1x 40.1 - 47M 276G github github
CAT-B RetinaNet ImageNet-1K 1x 41.4 - 62M 337G github github
CAT-S Cascade R-CNN ImageNet-1K 1x 44.1 - 82M 270G github github
CAT-B Cascade R-CNN ImageNet-1K 1x 44.8 - 96M 330G github github
CAT-S Cascade R-CNN+ ImageNet-1K 1x 45.2 - 82M 270G github github
CAT-B Cascade R-CNN+ ImageNet-1K 1x 46.3 - 96M 330G github github

Note: + indicates multi-scale training.

Models and Results on Semantic Segmentation (ADE20K val)

Backbone Method pretrain Crop Size Lr Schd mIoU mIoU (ms+flip) #params FLOPs model log
CAT-S Semantic FPN ImageNet-1K 512x512 80K 40.6 42.1 41M 214G github github
CAT-B Semantic FPN ImageNet-1K 512x512 80K 42.2 43.6 55M 276G github github
CAT-S Semantic FPN ImageNet-1K 512x512 160K 42.2 42.8 41M 214G github github
CAT-B Semantic FPN ImageNet-1K 512x512 160K 43.2 44.9 55M 276G github github

Citing CAT

You can cite the paper as:

@article{lin2021cat,
  title={CAT: Cross Attention in Vision Transformer},
  author={Hezheng Lin and Xing Cheng and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Qing Song and Wei Yuan},
  journal={arXiv preprint arXiv:2106.05786},
  year={2021}
}

Started

Please refer to get_started.

Acknowledgement

Our implementation is mainly based on Swin.

You might also like...
Implement A3C for Mujoco gym envs
Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Shufflenet-v2-Pytorch Introduction This is a Pytorch implementation of faceplusplus's ShuffleNet-v2. For details, please read the following papers:

implement of SwiftNet:Real-time Video Object Segmentation

SwiftNet The official PyTorch implementation of SwiftNet:Real-time Video Object Segmentation, which has been accepted by CVPR2021. Requirements Python

The implement of papar
The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

A pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021" 1. Notes This is a pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in

PyTorch Implement of Context Encoders: Feature Learning by Inpainting
PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch
Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

disclaimer: this code is modified from pytorch-tutorial Image classification with synthetic gradient in Pytorch I implement the Decoupled Neural Inter

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.
Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Demonstration of OpenVINO techniques - Model-division and a simplest-way to support custom layers Description: Model Optimizer in Intel(r) OpenVINO(tm

Implement some metaheuristics and cost functions
Implement some metaheuristics and cost functions

Metaheuristics This repot implement some metaheuristics and cost functions. Metaheuristics JAYA Implement Jaya optimizer without constraints. Cost fun

Releases(v1.0)
Pytorch implementation of Decoupled Spatial-Temporal Transformer for Video Inpainting

Decoupled Spatial-Temporal Transformer for Video Inpainting By Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun, Xiaogang Wang, J

51 Dec 13, 2022
Garbage classification using structure data.

垃圾分类模型使用说明 1.包含以下数据文件 文件 描述 data/MaterialMapping.csv 物体以及其归类的信息 data/TestRecords 光谱原始测试数据 CSV 文件 data/TestRecordDesc.zip CSV 文件描述文件 data/Boundaries.cs

wenqi 1 Dec 10, 2021
RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation (CIKM'17)

RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation This is the implementation of RATE: Overcoming Noise and Spar

Yu Zhang 5 Feb 10, 2022
An Implementation of SiameseRPN with Feature Pyramid Networks

SiameseRPN with FPN This project is mainly based on HelloRicky123/Siamese-RPN. What I've done is just add a Feature Pyramid Network method to the orig

3 Apr 16, 2022
A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data Overview Clustering analysis is widely utilized in single-cell RNA-seque

AI-Biomed @NSCC-gz 3 May 08, 2022
Boundary IoU API (Beta version)

Boundary IoU API (Beta version) Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C. Berg, Alexander Kirillov [arXiv] [Project] [BibTeX] This API is

Bowen Cheng 177 Dec 29, 2022
AI Virtual Calculator: This is a simple virtual calculator based on Artificial intelligence.

AI Virtual Calculator: This is a simple virtual calculator that works with gestures using OpenCV. We will use our hand in the air to click on the calc

Md. Rakibul Islam 1 Jan 13, 2022
Designing a Practical Degradation Model for Deep Blind Image Super-Resolution (ICCV, 2021) (PyTorch) - We released the training code!

Designing a Practical Degradation Model for Deep Blind Image Super-Resolution Kai Zhang, Jingyun Liang, Luc Van Gool, Radu Timofte Computer Vision Lab

Kai Zhang 804 Jan 08, 2023
This repository contains the entire code for our work "Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding"

Two-Timescale-DNN Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding This repository contains the entire code for our work

QiyuHu 3 Mar 07, 2022
An excellent hash algorithm combining classical sponge structure and RNN.

SHA-RNN Recurrent Neural Network with Chaotic System for Hash Functions Anonymous Authors [摘要] 在这次作业中我们提出了一种新的 Hash Function —— SHA-RNN。其以海绵结构为基础,融合了混

Houde Qian 5 May 15, 2022
FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation

FIRA is a learning-based commit message generation approach, which first represents code changes via fine-grained graphs and then learns to generate commit messages automatically.

Van 21 Dec 30, 2022
[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

CoCLR: Self-supervised Co-Training for Video Representation Learning This repository contains the implementation of: InfoNCE (MoCo on videos) UberNCE

Tengda Han 271 Jan 02, 2023
PuppetGAN - Cross-Domain Feature Disentanglement and Manipulation just got way better! 🚀

Better Cross-Domain Feature Disentanglement and Manipulation with Improved PuppetGAN Quite cool... Right? Introduction This repo contains a TensorFlow

Giorgos Karantonis 5 Aug 25, 2022
Hydra: an Extensible Fuzzing Framework for Finding Semantic Bugs in File Systems

Hydra: An Extensible Fuzzing Framework for Finding Semantic Bugs in File Systems Paper Finding Semantic Bugs in File Systems with an Extensible Fuzzin

gts3.org (<a href=[email protected])"> 129 Dec 15, 2022
An OpenAI Gym environment for Super Mario Bros

gym-super-mario-bros An OpenAI Gym environment for Super Mario Bros. & Super Mario Bros. 2 (Lost Levels) on The Nintendo Entertainment System (NES) us

Andrew Stelmach 1 Jan 05, 2022
Adaptive Graph Convolution for Point Cloud Analysis

Adaptive Graph Convolution for Point Cloud Analysis This repository contains the implementation of AdaptConv for point cloud analysis. Adaptive Graph

64 Dec 21, 2022
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Ubisoft 76 Dec 30, 2022
A PyTorch toolkit for 2D Human Pose Estimation.

PyTorch-Pose PyTorch-Pose is a PyTorch implementation of the general pipeline for 2D single human pose estimation. The aim is to provide the interface

Wei Yang 1.1k Dec 30, 2022
Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

Layne_Huang 7 Nov 14, 2022
One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking This is an official implementation for NEAS presented in CVPR

Multimedia Research 19 Sep 08, 2022