"Exploring Vision Transformers for Fine-grained Classification" at CVPRW FGVC8

Last update: Dec 06, 2022

Overview

FGVC8

Exploring Vision Transformers for Fine-grained Classification paper presented at the CVPR 2021, The Eight Workshop on Fine-Grained Visual Categorization on June 25th.

Abstract

Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most informative image regions and rely on them to classify the complete image. The most recent work, Vision Transformer (ViT), shows its strong performance in both traditional and fine-grained classification tasks.

In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism. We also introduce attention-guided augmentations for improving the model's capabilities.

We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's interpretability via qualitative results.

Instructions

Upcoming

Citation

If you find interesting our results, or you use or code/ideas please consider to cite our work:

@misc{conde2021exploring,
      title={Exploring Vision Transformers for Fine-grained Classification}, 
      author={Marcos V. Conde and Kerem Turgutlu},
      year={2021},
      eprint={2106.10587},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

"Exploring Vision Transformers for Fine-grained Classification" at CVPRW FGVC8

Related tags

Overview

FGVC8

Abstract

Instructions

Citation

References

Owner

Marcos V. Conde

Keyword-BERT: Keyword-Attentive Deep Semantic Matching

Convert Apple NeuralHash model for CSAM Detection to ONNX.

RSC-Net: 3D Human Pose, Shape and Texture from Low-Resolution Images and Videos

Implementation of the paper titled "Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees"

Deploy optimized transformer based models on Nvidia Triton server

[CVPR 2021] Generative Hierarchical Features from Synthesizing Images

Unofficial implementation of the Involution operation from CVPR 2021

"NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search".

MMGeneration is a powerful toolkit for generative models, based on PyTorch and MMCV.

Python inverse kinematics for your robot model based on Pinocchio.

Flow is a computational framework for deep RL and control experiments for traffic microsimulation.

PyTorch framework, for reproducing experiments from the paper Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

AdaDM: Enabling Normalization for Image Super-Resolution

MBPO (paper: When to trust your model: Model-based policy optimization) in offline RL settings

ColBERT: Contextualized Late Interaction over BERT (SIGIR'20)

MAU: A Motion-Aware Unit for Video Prediction and Beyond, NeurIPS2021

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

Efficient Sparse Attacks on Videos using Reinforcement Learning

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"