So-ViT: Mind Visual Tokens for Vision Transformer

Last update: Nov 24, 2022

Related tags

Overview

So-ViT: Mind Visual Tokens for Vision Transformer

Introduction

This repository contains the source code under PyTorch framework and models trained on ImageNet-1K dataset for the following paper:

@articles{So-ViT,
    author = {Jiangtao Xie, Ruiren Zeng, Qilong Wang, Ziqi Zhou, Peihua Li},
    title = {So-ViT: Mind Visual Tokens for Vision Transformer},
    booktitle = {arXiv:2104.10935},
    year = {2021}
}

The Vision Transformer (ViT) heavily depends on pretraining using ultra large-scale datasets (e.g. ImageNet-21K or JFT-300M) to achieve high performance, while significantly underperforming on ImageNet-1K if trained from scratch. We propose a novel So-ViT model toward addressing this problem, by carefully considering the role of visual tokens.

Above all, for classification head, the ViT only exploits class token while entirely neglecting rich semantic information inherent in high-level visual tokens. Therefore, we propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. Meanwhile, a fast singular value power normalization is proposed for improving the second-order pooling.

Second, the ViT employs the naïve method of one linear projection of fixed-size image patches for visual token embedding, lacking the ability to model translation equivariance and locality. To alleviate this problem, we develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.

Classification results

Classification results (single crop 224x224, %) on ImageNet-1K validation set

Network	Top-1 Accuracy		Pre-trained models
Network	Paper reported	Upgrade	GoogleDrive	BaiduCloud
So-ViT-7	76.2	76.8	Coming soon	Coming soon
So-ViT-10	77.9	78.7	Coming soon	Coming soon
So-ViT-14	81.8	82.3	Coming soon	Coming soon
So-ViT-19	82.4	82.8	Coming soon	Coming soon

Installation and Usage

Install PyTorch (>=1.6.0)
Install timm (==0.3.4)
pip install thop
type git clone https://github.com/jiangtaoxie/So-ViT
prepare the dataset as follows

.
├── train
│   ├── class1
│   │   ├── class1_001.jpg
│   │   ├── class1_002.jpg
|   |   └── ...
│   ├── class2
│   ├── class3
│   ├── ...
│   ├── ...
│   └── classN
└── val
    ├── class1
    │   ├── class1_001.jpg
    │   ├── class1_002.jpg
    |   └── ...
    ├── class2
    ├── class3
    ├── ...
    ├── ...
    └── classN

for training from scracth

sh model_name.sh  # model_name = {So_vit_7/10/14/19}

Acknowledgment

pytorch: https://github.com/pytorch/pytorch

timm: https://github.com/rwightman/pytorch-image-models

T2T-ViT: https://github.com/yitu-opensource/T2T-ViT

Contact

If you have any questions or suggestions, please contact me

[email protected]

So-ViT: Mind Visual Tokens for Vision Transformer

Related tags

Overview

So-ViT: Mind Visual Tokens for Vision Transformer

Introduction

Classification results

Classification results (single crop 224x224, %) on ImageNet-1K validation set

Installation and Usage

for training from scracth

Acknowledgment

Contact

Owner

Jiangtao Xie

AutoPentest-DRL: Automated Penetration Testing Using Deep Reinforcement Learning

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Code and data accompanying our SVRHM'21 paper.

This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

Awesome AI Learning with +100 AI Cheat-Sheets, Free online Books, Top Courses, Best Videos and Lectures, Papers, Tutorials, +99 Researchers, Premium Websites, +121 Datasets, Conferences, Frameworks, Tools

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

Angora is a mutation-based fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without symbolic execution.

Implementation of a Transformer using ReLA (Rectified Linear Attention)

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Computer Vision is an elective course of MSAI, SCSE, NTU, Singapore

Behavioral "black-box" testing for recommender systems

Unofficial implementation of the paper: PonderNet: Learning to Ponder in TensorFlow

A rule learning algorithm for the deduction of syndrome definitions from time series data.

Pytorch Implementation of PointNet and PointNet++++

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

This is an official pytorch implementation of Fast Fourier Convolution.

UMPNet: Universal Manipulation Policy Network for Articulated Objects

SuRE Evaluation: A Supplementary Material

Pre-training of Graph Augmented Transformers for Medication Recommendation

Detectorch - detectron for PyTorch