A curated list and survey of awesome Vision Transformers.

Overview
awesome-vit

English | 简体中文

A curated list and survey of awesome Vision Transformers.

You can use mind mapping software to open the mind mapping source file. You can also download the mind mapping HD pictures if you just want to browse them.

Contents

Survey

Only typical algorithms are listed in each category.

Image Classification

Chinese Blogs

Attention-based

image

Training Strategy

image

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
Model Improvements
Tokenization Module

image

Image to Token:

  • Non-overlapping Patch Embedding

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Overlapping Patch Embedding

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

Token to Token:

  • Fixed sampling window tokenization
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Dynamic sampling tokenization
    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
    • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
Position Encoding Module

image

Explicit position encoding:

  • Absolute position encoding
    • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
  • Relative position encoding
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

Implicit position encoding:

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Attention Module

image

Include only global attention:

  • Multi-Head attention module

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
  • Reduce global attention computation

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

    • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Generalized linear attention

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

Introduce extra local attention:

  • Local window mode

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
    • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
    • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • Introduce convolutional local inductive bias

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
    • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
  • Sparse attention

    • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]
FFN Module

image

Improve performance with Conv's local information extraction capability:

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
Normalization Module Location

image

  • Pre Normalization

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Post Normalization

    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
Classification Prediction Head Module

image

  • Class Tokens

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
  • Avgerage Pooling

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Others

image

(1) How to output multi-scale feature map

  • Patch merging

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
    • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • Pooling attention

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper][Imporved MViT]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Dilation convolution

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

(2) How to train a deeper Transformer

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

MLP-based

image

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

ConvMixer-based

  • [ConvMixer] Patches Are All You Need [Paper]

General Architecture Analysis

image

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Others

Object Detection

Semantic Segmentation

back to top

Papers

Transformer Original Paper

  • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]

ViT Original Paper

  • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]

Image Classification

2020

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]

2021

  • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]

  • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]

  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

  • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]

  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]

  • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

  • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]

  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]

  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

  • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]

  • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

  • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]

  • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]

  • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]

  • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]

  • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]

  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]

  • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]

  • [ConvMixer] Patches Are All You Need [Paper]

2022

  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Object Detection

Semantic Segmentation

back to top

Stay tuned and PRs are welcomed!

Owner
OpenMMLab
OpenMMLab
HyperaPy: An automatic hyperparameter optimization framework ⚡🚀

hyperpy HyperPy: An automatic hyperparameter optimization framework Description HyperPy: Library for automatic hyperparameter optimization. Build on t

Sergio Mora 7 Sep 06, 2022
Head and Neck Tumour Segmentation and Prediction of Patient Survival Project

Head-and-Neck-Tumour-Segmentation-and-Prediction-of-Patient-Survival Welcome to the Head and Neck Tumour Segmentation and Prediction of Patient Surviv

5 Oct 20, 2022
Pytorch implementation for ACMMM2021 paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation".

I2V-GAN This repository is the official Pytorch implementation for ACMMM2021 paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation". Traffic

69 Dec 31, 2022
Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite.

TFlite Ultra Fast Lane Detection Inference Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite. So

Ibai Gorordo 12 Aug 27, 2022
Official PyTorch Implementation of paper EAN: Event Adaptive Network for Efficient Action Recognition

Official PyTorch Implementation of paper EAN: Event Adaptive Network for Efficient Action Recognition

TianYuan 27 Nov 07, 2022
A lightweight tool to get an AI Infrastructure Stack up in minutes not days.

K3ai will take care of setup K8s for You, deploy the AI tool of your choice and even run your code on it.

k3ai 105 Dec 04, 2022
Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

56 Jan 02, 2023
A Dataset of Python Challenges for AI Research

Python Programming Puzzles (P3) This repo contains a dataset of python programming puzzles which can be used to teach and evaluate an AI's programming

Microsoft 850 Dec 24, 2022
Ağ tarayıcı.Gönderdiği paketler ile ağa bağlı olan cihazların IP adreslerini gösterir.

NetScanner.py Ağ tarayıcı.Gönderdiği paketler ile ağa bağlı olan cihazların IP adreslerini gösterir. Linux'da Kullanımı: git clone https://github.com/

4 Aug 23, 2021
N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting

N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting Recent progress in neural forecasting instigated significant improvements in the

Cristian Challu 82 Jan 04, 2023
Code for “ACE-HGNN: Adaptive Curvature ExplorationHyperbolic Graph Neural Network”

ACE-HGNN: Adaptive Curvature Exploration Hyperbolic Graph Neural Network This repository is the implementation of ACE-HGNN in PyTorch. Environment pyt

9 Nov 28, 2022
CVAT is free, online, interactive video and image annotation tool for computer vision

Computer Vision Annotation Tool (CVAT) CVAT is free, online, interactive video and image annotation tool for computer vision. It is being used by our

OpenVINO Toolkit 8.6k Jan 04, 2023
Differentiable simulation for system identification and visuomotor control

gradsim gradSim: Differentiable simulation for system identification and visuomotor control gradSim is a unified differentiable rendering and multiphy

105 Dec 18, 2022
Implementation of ETSformer, state of the art time-series Transformer, in Pytorch

ETSformer - Pytorch Implementation of ETSformer, state of the art time-series Transformer, in Pytorch Install $ pip install etsformer-pytorch Usage im

Phil Wang 121 Dec 30, 2022
A code generator from ONNX to PyTorch code

onnx-pytorch Generating pytorch code from ONNX. Currently support onnx==1.9.0 and torch==1.8.1. Installation From PyPI pip install onnx-pytorch From

Wenhao Hu 94 Jan 06, 2023
Reinforcement learning models in ViZDoom environment

DoomNet DoomNet is a ViZDoom agent trained by reinforcement learning. The agent is a neural network that outputs a probability of actions given only p

Andrey Kolishchak 126 Dec 09, 2022
Stochastic Normalizing Flows

Stochastic Normalizing Flows We introduce stochasticity in Boltzmann-generating flows. Normalizing flows are exact-probability generative models that

AI4Science group, FU Berlin (Frank Noé and co-workers) 50 Dec 16, 2022
Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper] This is Official PyTorch implementatio

42 Nov 01, 2022
Bachelor's Thesis in Computer Science: Privacy-Preserving Federated Learning Applied to Decentralized Data

federated is the source code for the Bachelor's Thesis Privacy-Preserving Federated Learning Applied to Decentralized Data (Spring 2021, NTNU) Federat

Dilawar Mahmood 25 Nov 30, 2022
Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, Pattern Recognition

USDAN The implementation of Unified unsupervised and semi-supervised domain adaptation network for cross-scenario face anti-spoofing, which is accepte

11 Nov 03, 2022