The official implementation for "FQ-ViT: Fully Quantized Vision Transformer without Retraining".

Overview

FQ-ViT [arXiv]

This repo contains the official implementation of "FQ-ViT: Fully Quantized Vision Transformer without Retraining".

Table of Contents

Introduction

Transformer-based architectures have achieved competitive performance in various CV tasks. Compared to the CNNs, Transformers usually have more parameters and higher computational costs, presenting a challenge when deployed to resource-constrained hardware devices.

Most existing quantization approaches are designed and tested on CNNs and lack proper handling of Transformer-specific modules. Previous work found there would be significant accuracy degradation when quantizing LayerNorm and Softmax of Transformer-based architectures. As a result, they left LayerNorm and Softmax unquantized with floating-point numbers. We revisit these two exclusive modules of the Vision Transformers and discover the reasons for degradation. In this work, we propose the FQ-ViT, the first fully quantized Vision Transformer, which contains two specific modules: Powers-of-Two Scale (PTS) and Log-Int-Softmax (LIS).

Layernorm quantized with Powers-of-Two Scale (PTS)

These two figures below show that there exists serious inter-channel variation in Vision Transformers than CNNs, which leads to unacceptable quantization errors with layer-wise quantization.

Taking the advantages of both layer-wise and channel-wise quantization, we propose PTS for LayerNorm's quantization. The core idea of PTS is to equip different channels with different Powers-of-Two Scale factors, rather than different quantization scales.

Softmax quantized with Log-Int-Softmax (LIS)

The storage and computation of attention map is known as a bottleneck for transformer structures, so we want to quantize it to extreme lower bit-width (e.g. 4-bit). However, if directly implementing 4-bit uniform quantization, there will be severe accuracy degeneration. We observe a distribution centering at a fairly small value of the output of Softmax, while only few outliers have larger values close to 1. Based on the following visualization, Log2 preserves more quantization bins than uniform for the small value interval with dense distribution.

Combining Log2 quantization with i-exp, which is a polynomial approximation of exponential function presented by I-BERT, we propose LIS, an integer-only, faster, low consuming Softmax.

The whole process is visualized as follow.

Getting Started

Install

  • Clone this repo.
git clone https://github.com/linyang-zhh/FQ-ViT.git
cd FQ-ViT
  • Create a conda virtual environment and activate it.
conda create -n fq-vit python=3.7 -y
conda activate fq-vit
  • Install PyTorch and torchvision. e.g.,
conda install pytorch=1.7.1 torchvision cudatoolkit=10.1 -c pytorch

Data preparation

You should download the standard ImageNet Dataset.

├── imagenet
│   ├── train
|
│   ├── val

Run

Example: Evaluate quantized DeiT-S with MinMax quantizer and our proposed PTS and LIS

python test_quant.py deit_small <YOUR_DATA_DIR> --quant --pts --lis --quant-method minmax
  • deit_small: model architecture, which can be replaced by deit_tiny, deit_base, vit_base, vit_large, swin_tiny, swin_small and swin_base.

  • --quant: whether to quantize the model.

  • --pts: whether to use Power-of-Two Scale Integer Layernorm.

  • --lis: whether to use Log-Integer-Softmax.

  • --quant-method: quantization methods of activations, which can be chosen from minmax, ema, percentile and omse.

Results on ImageNet

This paper employs several current post-training quantization strategies together with our methods, including MinMax, EMA , Percentile and OMSE.

  • MinMax uses the minimum and maximum values of the total data as the clipping values;

  • EMA is based on MinMax and uses an average moving mechanism to smooth the minimum and maximum values of different mini-batch;

  • Percentile assumes that the distribution of values conforms to a normal distribution and uses the percentile to clip. In this paper, we use the 1e-5 percentile because the 1e-4 commonly used in CNNs has poor performance in Vision Transformers.

  • OMSE determines the clipping values by minimizing the quantization error.

The following results are evaluated on ImageNet.

Method W/A/Attn Bits ViT-B ViT-L DeiT-T DeiT-S DeiT-B Swin-T Swin-S Swin-B
Full Precision 32/32/32 84.53 85.81 72.21 79.85 81.85 81.35 83.20 83.60
MinMax 8/8/8 23.64 3.37 70.94 75.05 78.02 64.38 74.37 25.58
MinMax w/ PTS 8/8/8 83.31 85.03 71.61 79.17 81.20 80.51 82.71 82.97
MinMax w/ PTS, LIS 8/8/4 82.68 84.89 71.07 78.40 80.85 80.04 82.47 82.38
EMA 8/8/8 30.30 3.53 71.17 75.71 78.82 70.81 75.05 28.00
EMA w/ PTS 8/8/8 83.49 85.10 71.66 79.09 81.43 80.52 82.81 83.01
EMA w/ PTS, LIS 8/8/4 82.57 85.08 70.91 78.53 80.90 80.02 82.56 82.43
Percentile 8/8/8 46.69 5.85 71.47 76.57 78.37 78.78 78.12 40.93
Percentile w/ PTS 8/8/8 80.86 85.24 71.74 78.99 80.30 80.80 82.85 83.10
Percentile w/ PTS, LIS 8/8/4 80.22 85.17 71.23 78.30 80.02 80.46 82.67 82.79
OMSE 8/8/8 73.39 11.32 71.30 75.03 79.57 79.30 78.96 48.55
OMSE w/ PTS 8/8/8 82.73 85.27 71.64 78.96 81.25 80.64 82.87 83.07
OMSE w/ PTS, LIS 8/8/4 82.37 85.16 70.87 78.42 80.90 80.41 82.57 82.45

Citation

If you find this repo useful in your research, please consider citing the following paper:

@misc{
    lin2021fqvit,
    title={FQ-ViT: Fully Quantized Vision Transformer without Retraining}, 
    author={Yang Lin and Tianyu Zhang and Peiqin Sun and Zheng Li and Shuchang Zhou},
    year={2021},
    eprint={2111.13824},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation By Qiang Zhou*, Zilong Huang*, Lichao Huang, Han Shen, Yon

Forest 117 Apr 01, 2022
Experiments for Fake News explainability project

fake-news-explainability Experiments for fake news explainability project This repository only contains the notebooks used to train the models and eva

Lorenzo Flores (Lj) 1 Dec 03, 2022
This repo tries to recognize faces in the dataset you created

YÜZ TANIMA SİSTEMİ Bu repo oluşturacağınız yüz verisetlerini tanımaya çalışan ma

Mehdi KOŞACA 2 Dec 30, 2021
Styleformer - Official Pytorch Implementation

Styleformer -- Official PyTorch implementation Styleformer: Transformer based Generative Adversarial Networks with Style Vector(https://arxiv.org/abs/

Jeeseung Park 159 Dec 12, 2022
Neighborhood Reconstructing Autoencoders

Neighborhood Reconstructing Autoencoders The official repository for Neighborhood Reconstructing Autoencoders (Lee, Kwon, and Park, NeurIPS 2021). T

Yonghyeon Lee 24 Dec 14, 2022
PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models Code accompanying CVPR'20 paper of the same title. Paper lin

Alex Damian 7k Dec 30, 2022
This repo implements a 3D segmentation task for an airport baggage dataset.

3D CT Scan Segmentation With Occupancy Network This repo implements a 3D superresolution segmentation task for an airport baggage dataset. Our final p

Christoph Reich 2 Mar 28, 2022
This repository is an implementation of paper : Improving the Training of Graph Neural Networks with Consistency Regularization

CRGNN Paper : Improving the Training of Graph Neural Networks with Consistency Regularization Environments Implementing environment: GeForce RTX™ 3090

THUDM 28 Dec 09, 2022
Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Vision Transformer Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: T

Eunkwang Jeon 1.4k Dec 28, 2022
Code for Learning to Segment The Tail (LST)

Learning to Segment the Tail [arXiv] In this repository, we release code for Learning to Segment The Tail (LST). The code is directly modified from th

47 Nov 07, 2022
3D ResNet Video Classification accelerated by TensorRT

Activity Recognition TensorRT Perform video classification using 3D ResNets trained on Kinetics-400 dataset and accelerated with TensorRT P.S Click on

Akash James 39 Nov 21, 2022
An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

Kakao Brain 72 Dec 28, 2022
This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF).

VaxNeRF Paper | Google Colab This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF). This codebase is implemented using JAX, buildin

naruya 132 Nov 21, 2022
Pytorch implementation of Deep Recursive Residual Network for Super Resolution (DRRN)

DRRN-pytorch This is an unofficial implementation of "Deep Recursive Residual Network for Super Resolution (DRRN)", CVPR 2017 in Pytorch. [Paper] You

yun_yang 192 Dec 12, 2022
Object tracking implemented with YOLOv4, DeepSort, and TensorFlow.

Object tracking implemented with YOLOv4, DeepSort, and TensorFlow. YOLOv4 is a state of the art algorithm that uses deep convolutional neural networks to perform object detections. We can take the ou

The AI Guy 1.1k Dec 29, 2022
Classic Papers for Beginners and Impact Scope for Authors.

There have been billions of academic papers around the world. However, maybe only 0.0...01% among them are valuable or are worth reading. Since our limited life has never been forever, TopPaper provi

Qiulin Zhang 228 Dec 18, 2022
A pytorch reproduction of { Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation }.

A PyTorch Reproduction of HCN Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. Ch

Guyue Hu 210 Dec 31, 2022
Official Pytorch implementation for "End2End Occluded Face Recognition by Masking Corrupted Features, TPAMI 2021"

End2End Occluded Face Recognition by Masking Corrupted Features This is the Pytorch implementation of our TPAMI 2021 paper End2End Occluded Face Recog

Haibo Qiu 25 Oct 31, 2022
Reverse engineer your pytorch vision models, in style

🔍 Rover Reverse engineer your CNNs, in style Rover will help you break down your CNN and visualize the features from within the model. No need to wri

Mayukh Deb 32 Sep 24, 2022
Project page for End-to-end Recovery of Human Shape and Pose

End-to-end Recovery of Human Shape and Pose Angjoo Kanazawa, Michael J. Black, David W. Jacobs, Jitendra Malik CVPR 2018 Project Page Requirements Pyt

1.4k Dec 29, 2022