SPT_LSA_ViT - Implementation for Visual Transformer for Small-size Datasets

Last update: Jan 01, 2023

Related tags

Deep Learning SPT_LSA_ViT

Overview

Vision Transformer for Small-Size Datasets

Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song | Paper

Inha University

Abstract

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08% thanks to the proposed SPT and LSA.

Method

Shifted Patch Tokenization

Locality Self-Attention

Model Performance

Small-Size Dataset Classification

Model	FLOPs	CIFAR10	CIFAR100	SVHN	Tiny-ImageNet
ViT	189.8	93.58	73.81	97.82	57.07
SL-ViT	199.2	94.53	76.92	97.79	61.07
T2T	643.0	95.30	77.00	97.90	60.57
SL-T2T	671.4	95.57	77.36	97.91	61.83
CaiT	613.8	94.91	76.89	98.13	64.37
SL-CaiT	623.3	95.81	80.32	98.28	67.18
PiT	279.2	94.24	74.99	97.83	60.25
SL-PiT	322.9	95.88	79.00	97.93	62.91
Swin	242.3	94.46	76.87	97.72	60.87
SL-Swin	284.9	95.93	79.99	97.92	64.95

Accuracy-Throughput Graph

How to train models

Pure ViT

python main.py --model vit

SL-Swin

python main.py --model swin --is_LSA --is_SPT

Citation

@misc{lee2021vision,
      title={Vision Transformer for Small-Size Datasets}, 
      author={Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song},
      year={2021},
      eprint={2112.13492},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

SPT_LSA_ViT - Implementation for Visual Transformer for Small-size Datasets

Related tags

Overview

Vision Transformer for Small-Size Datasets

Abstract

Method

Shifted Patch Tokenization

Locality Self-Attention

Model Performance

Small-Size Dataset Classification

Accuracy-Throughput Graph

How to train models

Pure ViT

SL-Swin

Citation

Owner

Lee SeungHoon

the code for paper "Energy-Based Open-World Uncertainty Modeling for Confidence Calibration"

MoViNets PyTorch implementation: Mobile Video Networks for Efficient Video Recognition;

Progressive Domain Adaptation for Object Detection

simple_pytorch_example project is a toy example of a python script that instantiates and trains a PyTorch neural network on the FashionMNIST dataset

Automatically align face images 🙃→🙂. Can also do windowing and warping.

A Traffic Sign Recognition Project which can help the driver recognise the signs via text as well as audio. Can be used at Night also.

a morph transfer UGATIT for image translation.

Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation

Unsupervised Domain Adaptation for Nighttime Aerial Tracking (CVPR2022)

This repository is a series of notebooks that show solutions for the projects at Dataquest.io.

Wafer Fault Detection using MlOps Integration

A particular navigation route using satellite feed and can help in toll operations & traffic managemen

code for ICCV 2021 paper 'Generalized Source-free Domain Adaptation'

A PyTorch implementation of the Relational Graph Convolutional Network (RGCN).

The 3rd place solution for competition

Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment"

A object detecting neural network powered by the yolo architecture and leveraging the PyTorch framework and associated libraries.