Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Overview

Focal Transformer

PWC PWC PWC PWC PWC PWC

This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transformers", by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.

Introduction

focal-transformer-teaser

Our Focal Transfomer introduced a new self-attention mechanism called focal self-attention for vision transformers. In this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively.

With our Focal Transformers, we achieved superior performance over the state-of-the-art vision Transformers on a range of public benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.6 and 84.0 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art methods for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation.

Benchmarking

Image Classification on ImageNet-1K

Model Pretrain Use Conv Resolution [email protected] [email protected] #params FLOPs Checkpoint Config
Focal-T IN-1K No 224 82.2 95.9 28.9M 4.9G download yaml
Focal-T IN-1K Yes 224 82.7 96.1 30.8M 4.9G download yaml
Focal-S IN-1K No 224 83.6 96.2 51.1M 9.4G download yaml
Focal-B IN-1K No 224 84.0 96.5 89.8M 16.4G download yaml

Object Detection and Instance Segmentation on COCO

Mask R-CNN

Backbone Pretrain Lr Schd #params FLOPs box mAP mask mAP
Focal-T ImageNet-1K 1x 49M 291G 44.8 41.0
Focal-T ImageNet-1K 3x 49M 291G 47.2 42.7
Focal-S ImageNet-1K 1x 71M 401G 47.4 42.8
Focal-S ImageNet-1K 3x 71M 401G 48.8 43.8
Focal-B ImageNet-1K 1x 110M 533G 47.8 43.2
Focal-B ImageNet-1K 3x 110M 533G 49.0 43.7

RetinaNet

Backbone Pretrain Lr Schd #params FLOPs box mAP
Focal-T ImageNet-1K 1x 39M 265G 43.7
Focal-T ImageNet-1K 3x 39M 265G 45.5
Focal-S ImageNet-1K 1x 62M 367G 45.6
Focal-S ImageNet-1K 3x 62M 367G 47.3
Focal-B ImageNet-1K 1x 101M 514G 46.3
Focal-B ImageNet-1K 3x 101M 514G 46.9

Other detection methods

Backbone Pretrain Method Lr Schd #params FLOPs box mAP
Focal-T ImageNet-1K Cascade Mask R-CNN 3x 87M 770G 51.5
Focal-T ImageNet-1K ATSS 3x 37M 239G 49.5
Focal-T ImageNet-1K RepPointsV2 3x 45M 491G 51.2
Focal-T ImageNet-1K Sparse R-CNN 3x 111M 196G 49.0

Semantic Segmentation on ADE20K

Backbone Pretrain Method Resolution Iters #params FLOPs mIoU mIoU (MS)
Focal-T ImageNet-1K UPerNet 512x512 160k 62M 998G 45.8 47.0
Focal-S ImageNet-1K UPerNet 512x512 160k 85M 1130G 48.0 50.0
Focal-B ImageNet-1K UPerNet 512x512 160k 126M 1354G 49.0 50.5
Focal-L ImageNet-22K UPerNet 640x640 160k 240M 3376G 54.0 55.4

Getting Started

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2021focal,
    title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, 
    author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},
    year={2021},
    eprint={2107.00641},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

Our codebase is built based on Swin-Transformer. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
CLOOB training (JAX) and inference (JAX and PyTorch)

cloob-training Pretrained models There are two pretrained CLOOB models in this repo at the moment, a 16 epoch and a 32 epoch ViT-B/16 checkpoint train

Katherine Crowson 64 Nov 27, 2022
eXPeditious Data Transfer

xpdt: eXPeditious Data Transfer About xpdt is (yet another) language for defining data-types and generating code for serializing and deserializing the

Gianni Tedesco 3 Jan 06, 2022
Research on controller area network Intrusion Detection Systems

Group members information Member 1: Lixue Liang Member 2: Yuet Lee Chan Member 3: Xinruo Zhang Member 4: Yifei Han User Manual Generate Attack Packets

Roche 4 Aug 30, 2022
这是一个yolox-keras的源码,可以用于训练自己的模型。

YOLOX:You Only Look Once目标检测模型在Keras当中的实现 目录 性能情况 Performance 实现的内容 Achievement 所需环境 Environment 小技巧的设置 TricksSet 文件下载 Download 训练步骤 How2train 预测步骤 Ho

Bubbliiiing 64 Nov 10, 2022
Person Re-identification

Person Re-identification Final project of Computer Vision Table of content Person Re-identification Table of content Students: Proposed method Dataset

Nguyễn Hoàng Quân 4 Jun 17, 2021
PyTorch and Tensorflow functional model definitions

functional-zoo Model definitions and pretrained weights for PyTorch and Tensorflow PyTorch, unlike lua torch, has autograd in it's core, so using modu

Sergey Zagoruyko 590 Dec 22, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

Rithesh Kumar 16 Oct 06, 2022
Platform-agnostic AI Framework 🔥

🇬🇧 TensorLayerX is a multi-backend AI framework, which can run on almost all operation systems and AI hardwares, and support hybrid-framework progra

TensorLayer Community 171 Jan 06, 2023
IA for recognising Traffic Signs using Keras [Tensorflow]

Traffic Signs Recognition ⚠️ 🚦 Fundamentals of Intelligent Systems Introduction 📄 Development of a neural network capable of recognizing nine differ

Sebastián Fernández García 2 Dec 19, 2022
Oriented Object Detection: Oriented RepPoints + Swin Transformer/ReResNet

Oriented RepPoints for Aerial Object Detection The code for the implementation of “Oriented RepPoints + Swin Transformer/ReResNet”. Introduction Based

96 Dec 13, 2022
JFB: Jacobian-Free Backpropagation for Implicit Models

JFB: Jacobian-Free Backpropagation for Implicit Models

Typal Research 28 Dec 11, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

❤️ A Toolkit for Document-level Event Extraction with & without Triggers Hi, there 👋 . Thanks for your stay in this repo. This project aims at buildi

Tong Zhu(朱桐) 159 Dec 22, 2022
A collection of resources on GAN Inversion.

This repo is a collection of resources on GAN inversion, as a supplement for our survey

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Matthias Fey 139 Dec 25, 2022
Py-FEAT: Python Facial Expression Analysis Toolbox

Py-FEAT is a suite for facial expressions (FEX) research written in Python. This package includes tools to detect faces, extract emotional facial expressions (e.g., happiness, sadness, anger), facial

Computational Social Affective Neuroscience Laboratory 147 Jan 06, 2023
Compositional Sketch Search

Compositional Sketch Search Official repository for ICIP 2021 Paper: Compositional Sketch Search Requirements Install and activate conda environment c

Alexander Black 8 Sep 06, 2021
Probabilistic Programming and Statistical Inference in PyTorch

PtStat Probabilistic Programming and Statistical Inference in PyTorch. Introduction This project is being developed during my time at Cogent Labs. The

Stefano Peluchetti 109 Nov 26, 2022
Small utility to demangle Nim symbols in callgrind files

nim_callgrind A small utility to demangle Nim symbols from callgrind files. Usage Run your (Nim) program with something like this: valgrind --tool=cal

kraptor 3 Feb 15, 2022
GMFlow: Learning Optical Flow via Global Matching

GMFlow GMFlow: Learning Optical Flow via Global Matching Authors: Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Dacheng Tao We streamline the

Haofei Xu 298 Jan 04, 2023