Official repository for "Intriguing Properties of Vision Transformers" (2021)

Overview

Intriguing Properties of Vision Transformers

Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, & Ming-Hsuan Yang

Paper Link

Abstract: Vision transformers (ViT) have demonstrated impressive performance across various machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms. Our code will be publicly released.

Citation

@misc{naseer2021intriguing,
      title={Intriguing Properties of Vision Transformers}, 
      author={Muzammal Naseer and Kanchana Ranasinghe and Salman Khan and Munawar Hayat and Fahad Shahbaz Khan and Ming-Hsuan Yang},
      year={2021},
      eprint={2105.10497},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

We are in the process of cleaning our code. We will update this repo shortly. Here are the highlights of what to expect :)

  1. Pretrained ViT models trained on Stylized ImageNet (along with distilled ones). We will provide code to use these models for auto-segmentation.
  2. Training and Evaluations for our proposed off-the-shelf ensemble features.
  3. Code to evaluate any model on our proposed occulusion stratagies (random, foreground and background).
  4. Code for evaluation of permutation invaraince.
  5. Pretrained models to study the effect of varying patch sizes and positional encoding.
  6. Pretrained adversarial patches and code to evalute them.
  7. Training on Stylized Imagenet.

Requirements

pip install -r requirements.txt

Shape Biased Models

Our shape biased pretrained models can be downloaded from here. Code for evaluating their shape bias using auto segmentation on the PASCAL VOC dataset can be found under scripts. Please fix any paths as necessary. You may place the VOC devkit folder under data/voc of fix the paths appropriately.

Running segmentation evaluation on models:

./scripts/eval_segmentation.sh

Visualizing segmentation for images in a given folder:

./scripts/visualize_segmentation.sh

Off the Shelf Classification

Training code for off-the-shelf experiment in classify_metadataset.py. Seven datasets (aircraft CUB DTD fungi GTSRB Places365 INAT) available by default. Set the appropriate dir path in classify_md.sh by fixing DATA_PATH.

Run training and evaluation for a selected dataset (aircraft by default) using selected model (DeiT-T by default):

./scripts/classify_md.sh

Occlusion Evaluation

Evaluation on ImageNet val set (change path in script) for our proposed occlusion techniques:

./scripts/evaluate_occlusion.sh

Permutation Invariance Evaluation

Evaluation on ImageNet val set (change path in script) for the shuffle operation:

./scripts/evaluate_shuffle.sh

Varying Patch Sizes and Positional Encoding

Pretrained models to study the effect of varying patch sizes and positional encoding:

DeiT-T Model Top-1 Top-5 Pretrained
No Pos. Enc. 68.3 89.0 Link
Patch 22 68.7 89.0 Link
Patch 28 65.2 86.7 Link
Patch 32 63.1 85.3 Link
Patch 38 55.2 78.8 Link

References

Code borrowed from DeiT and DINO repositories.

Comments
  • Question about links of pretrained models

    Question about links of pretrained models

    Hi! First of all, thank the authors for the exciting work! I noticed that the checkpoint link of the pretrained 'deit_tiny_distilled_patch16_224' in vit_models/deit.py is different from the one of the shape-biased model DeiT-T-SIN (distilled), as given in README.md. I thought deit_tiny_distilled_patch16_224 has the same definition with DeiT-T-SIN (distilled). Do they have differences in model architecture or training procedure?

    opened by ZhouqyCH 3
  • Two questions on your paper

    Two questions on your paper

    Hi. This is heonjin.

    Firstly, big thanks to you and your paper. well-read and precise paper! I have two questions on your paper.

    1. Please take a look at Figure 9. image On the 'no positional encoding' experiment, there is a peak on 196 shuffle size of "DeiT-T-no-pos". Why is there a peak? and I wonder why there is a decreasing from 0 shuffle size to 64 of "DeiT-T-no-pos".

    2. On the Figure 14, image On the Aircraft(few shot), Flower(few shot) dataset, CNN performs better than DeiT. Could you explain this why?

    Thanks in advance.

    opened by hihunjin 2
  • Attention maps DINO Patchdrop

    Attention maps DINO Patchdrop

    Hi, thanks for the amazing paper.

    My question is about how which patches are dropped from the image with the DINO model. It looks like in the code in evaluate.py on line 132 head_number = 1. I want to understand the reason why this number was chosen (the other params used to index the attention maps seem to make sense). Wouldn't averaging the attention maps across heads give you better segmentation?

    Thanks,

    Ravi

    opened by rraju1 1
  • Support CPU when visualizing segmentations

    Support CPU when visualizing segmentations

    Most of the code to visualize segmentation is ready for GPU and CPU, but I bumped into this one place where there is a hard-coded .cuda() call. I changed it to .to(device) to support CPU.

    opened by cgarbin 0
  • Expand the instructions to install the PASCAL VOC dataset

    Expand the instructions to install the PASCAL VOC dataset

    I inspected the code to understand the expected directory structure. This note in the README may help other users put the dataset in the right place from the start.

    opened by cgarbin 0
  • Add note to use Python 3.8 because of PyTorch 1.7

    Add note to use Python 3.8 because of PyTorch 1.7

    PyTorch 1.7 requires Python 3.8. Refer to the discussion in https://github.com/pytorch/pytorch/issues/47354.

    Suggest adding this note to the README to help reproduce the environment because running pip install -r requirements.txt with the wrong version of Python gives an obscure error message.

    opened by cgarbin 0
  • Amazing work, but can it work on DETR?

    Amazing work, but can it work on DETR?

    ViT family show strong robustness on RandomDrop and Domain shift Problem. The thing is , I 'm working on object detection these days,detr is an end to end object detection methods which adopted Transformer's encoder decoder part, but the backbone I use , is Resnet50, it can still find the properties that your paper mentioned. Above all I want to ask two questions: (1).Do these intriguing properties come from encoder、decoder part? (2).What's the difference between distribution shift and domain shift(I saw distribution shift first time on your paper)?

    opened by 1184125805 0
Owner
Muzammal Naseer
PhD student at Australian National University.
Muzammal Naseer
A framework for using LSTMs to detect anomalies in multivariate time series data. Includes spacecraft anomaly data and experiments from the Mars Science Laboratory and SMAP missions.

Telemanom (v2.0) v2.0 updates: Vectorized operations via numpy Object-oriented restructure, improved organization Merge branches into single branch fo

Kyle Hundman 844 Dec 28, 2022
Realtime segmentation with ENet, the fast and accurate segmentation net.

Enet This is a realtime segmentation net with almost 22 fps on GTX1080 ti, and the model size is very small with only 28M. This repo contains the infe

JinTian 14 Aug 30, 2022
Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Tacotron 2 (without wavenet) PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. This implementati

NVIDIA Corporation 4.1k Jan 03, 2023
1st Solution For NeurIPS 2021 Competition on ML4CO Dual Task

KIDA: Knowledge Inheritance in Data Aggregation This project releases our 1st place solution on NeurIPS2021 ML4CO Dual Task. Slide and model weights a

MEGVII Research 24 Sep 08, 2022
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

CurriculumNet Introduction This repo contains related code and models from the ECCV 2018 CurriculumNet paper. CurriculumNet is a new training strategy

156 Jul 04, 2022
The PyTorch implementation for paper "Neural Texture Extraction and Distribution for Controllable Person Image Synthesis" (CVPR2022 Oral)

ArXiv | Get Start Neural-Texture-Extraction-Distribution The PyTorch implementation for our paper "Neural Texture Extraction and Distribution for Cont

Ren Yurui 111 Dec 10, 2022
A unofficial pytorch implementation of PAN(PSENet2): Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network Requirements pytorch 1.1+ torchvision 0.3+ pyclipper opencv3 gcc

zhoujun 400 Dec 26, 2022
Codebase for BMVC 2021 paper "Text Based Person Search with Limited Data"

Text Based Person Search with Limited Data This is the codebase for our BMVC 2021 paper. Please bear with me refactoring this codebase after CVPR dead

Xiao Han 33 Nov 24, 2022
🌳 A Python-inspired implementation of the Optimum-Path Forest classifier.

OPFython: A Python-Inspired Optimum-Path Forest Classifier Welcome to OPFython. Note that this implementation relies purely on the standard LibOPF. Th

Gustavo Rosa 30 Jan 04, 2023
Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

23 Oct 17, 2022
A scanpy extension to analyse single-cell TCR and BCR data.

Scirpy: A Scanpy extension for analyzing single-cell immune-cell receptor sequencing data Scirpy is a scalable python-toolkit to analyse T cell recept

ICBI 145 Jan 03, 2023
Fibonacci Method Gradient Descent

An implementation of the Fibonacci method for gradient descent, featuring a TKinter GUI for inputting the function / parameters to be examined and a matplotlib plot of the function and results.

Emma 1 Jan 28, 2022
Rainbow: Combining Improvements in Deep Reinforcement Learning

Rainbow Rainbow: Combining Improvements in Deep Reinforcement Learning [1]. Results and pretrained models can be found in the releases. DQN [2] Double

Kai Arulkumaran 1.4k Dec 29, 2022
Music library streaming app written in Flask & VueJS

djtaytay This is a little toy app made to explore Vue, brush up on my Python, and make a remote music collection accessable through a web interface. I

Ryan Tasson 6 May 27, 2022
This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures using receptive field analysis (RFA) and create graph visualizations of your architecture.

ReceptiveFieldAnalysisToolbox This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures usin

84 Nov 23, 2022
diablo2 resurrected loot filter

Only For Chinese and Traditional Chinese The filter only for Chinese and Traditional Chinese, i didn't change it for other language.Maybe you could mo

elmagnifico 249 Dec 04, 2022
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
Lecture materials for Cornell CS5785 Applied Machine Learning (Fall 2021)

Applied Machine Learning (Cornell CS5785, Fall 2021) This repo contains executable course notes and slides for the Applied ML course at Cornell and Co

Volodymyr Kuleshov 103 Dec 31, 2022
Code repository for the paper "Tracking People with 3D Representations"

Tracking People with 3D Representations Code repository for the paper "Tracking People with 3D Representations" (paper link) (project site). Jathushan

Jathushan Rajasegaran 77 Dec 03, 2022
PyTorch trainer and model for Sequence Classification

PyTorch-trainer-and-model-for-Sequence-Classification After cloning the repository, modify your training data so that the training data is a .csv file

NhanTieu 2 Dec 09, 2022