PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

Last update: Oct 30, 2022

Related tags

Overview

MAE for Self-supervised ViT

Introduction

This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

This repo is mainly based on moco-v3, pytorch-image-models and BEiT

TODO

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning.

Vit-Base

pretrain epochs	with pixel-norm	linear acc	fine-tuning acc
100	False	--	75.58 [1]
100	True	--	77.19
800	True	--	--

On 8 NVIDIA GeForce RTX 3090 GPUs, pretrain for 100 epochs needs about 9 hours, 4096 batch size needs about 24 GB GPU memory.

[1]. fine-tuning for 50 epochs;

Vit-Large

pretrain epochs	with pixel-norm	linear acc	fine-tuning acc
100	False	--	--
100	True	--	--

On 8 NVIDIA A40 GPUs, pretrain for 100 epochs needs about 34 hours, 4096 batch size needs about xx GB GPU memory.

Usage: Preparation

The code has been tested with CUDA 11.4, PyTorch 1.8.2.

Notes:

The batch size specified by -b is the total batch size across all GPUs from all nodes.
The learning rate specified by --lr is the base lr (corresponding to 256 batch-size), and is adjusted by the linear lr scaling rule.
In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.
Only pretraining and finetuning have been tested.

Usage: Self-supervised Pre-Training

Below is examples for MAE pre-training.

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 4096

python main_mae.py \
  -c cfgs/ViT-B16_ImageNet1K_pretrain.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

sh train_mae.sh

ViT-Large with 1-node (8-GPU, NVIDIA A40) pre-training, batch 2048

python main_mae.py \
  -c cfgs/ViT-L16_ImageNet1K_pretrain.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

Below is examples for MAE fine-tuning.

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 1024

python main_fintune.py \
  -c cfgs/ViT-B16_ImageNet1K_finetune.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Large with 2-node (16-GPU, 8 NVIDIA GeForce RTX 3090 + 8 NVIDIA A40) training, batch 512

python main_fintune.py \
  -c cfgs/ViT-B16_ImageNet1K_finetune.yaml \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On another node, run the same command with --rank 1.

Note:

We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.

[TODO] Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you use the code of this repo, please cite the original papre and this repo:

@Article{he2021mae,
  author  = {Kaiming He* and Xinlei Chen* and Saining Xie and Yanghao Li and Piotr Dolla ́r and Ross Girshick},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  journal = {arXiv preprint arXiv:2111.06377},
  year    = {2021},
}

@misc{yang2021maepriv,
  author       = {Lu Yang* and Pu Cao* and Yang Nie and Qing Song},
  title        = {MAE-priv},
  howpublished = {\url{https://github.com/BUPT-PRIV/MAE-priv}},
  year         = {2021},
}

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

Related tags

Overview

MAE for Self-supervised ViT

Introduction

TODO

Main Results

Vit-Base

Vit-Large

Usage: Preparation

Notes:

Usage: Self-supervised Pre-Training

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 4096

ViT-Large with 1-node (8-GPU, NVIDIA A40) pre-training, batch 2048

Usage: End-to-End Fine-tuning ViT

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 1024

ViT-Large with 2-node (16-GPU, 8 NVIDIA GeForce RTX 3090 + 8 NVIDIA A40) training, batch 512

[TODO] Usage: Linear Classification

License

Citation

Owner

Fortuitous Forgetting in Connectionist Networks

Behavioral "black-box" testing for recommender systems

Video2x - A lossless video/GIF/image upscaler achieved with waifu2x, Anime4K, SRMD and RealSR.

dualFace: Two-Stage Drawing Guidance for Freehand Portrait Sketching (CVMJ)

Pytorch for Segmentation

Residual Dense Net De-Interlace Filter (RDNDIF)

Pytorch implementation for "Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets" (ECCV 2020 Spotlight)

Official pytorch implementation of Active Learning for deep object detection via probabilistic modeling (ICCV 2021)

A PyTorch implementation of a Factorization Machine module in cython.

Self-supervised Deep LiDAR Odometry for Robotic Applications

OpenMMLab Semantic Segmentation Toolbox and Benchmark.

Implementation of the Chamfer Distance as a module for pyTorch

Gans-in-action - Companion repository to GANs in Action: Deep learning with Generative Adversarial Networks

Anagram Generator in Python

"Segmenter: Transformer for Semantic Segmentation" reproduced via mmsegmentation

Source code for Fixed-Point GAN for Cloud Detection

HyDiff: Hybrid Differential Software Analysis

Automatic differentiation with weighted finite-state transducers.

thundernet ncnn

PyTorch implementation of "A Simple Baseline for Low-Budget Active Learning".