Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Last update: Jan 04, 2023

Related tags

Overview

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

This repository is built upon BEiT, thanks very much!

Now, we only implement the pretrain process according to the paper, and can't guarantee the performance reported in the paper can be reproduced!

Difference

At the same time, shuffle and unshuffle operations don't seem to be directly accessible in pytorch, so we use another method to realize this process:

For shuffle, we used the method of randomly generating mask-map (14x14) in BEiT, where mask=0 illustrates keep the token, mask=1 denotes drop the token (not participating caculation in Encoder). Then all visible tokens (mask=0) are put into encoder network.
For unshuffle, we get the postion embeddings (with adding the shared mask token) of all mask tokens according to the mask-map and then concate them with the visible tokens (from encoder), and put them into the decoder network to recontrust.

TODO

implement the finetune process
reuse the model in modeling_pretrain.py
caculate the normalized pixels target
add the cls token in the encoder
...

Setup

pip install -r requirements.txt

Run

# Set the path to save checkpoints
OUTPUT_DIR='output/'
# path to imagenet-1k train set
DATA_PATH='../ImageNet_ILSVRC2012/train'


OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_mae_pretraining.py \
        --data_path ${DATA_PATH} \
        --mask_ratio 0.75 \
        --model pretrain_mae_base_patch16_224 \
        --batch_size 128 \
        --opt_betas 0.9 0.95 \
        --warmup_epochs 40 \
        --epochs 1600 \
        --output_dir ${OUTPUT_DIR}

Note: the pretrain result is on the way ~

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Related tags

Overview

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Difference

TODO

Setup

Run

Owner

Zhiliang Peng

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

YOLOX + ROS(1, 2) object detection package

PyTorch implementation of U-TAE and PaPs for satellite image time series panoptic segmentation.

CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

SeisComP/SeisBench interface to enable deep-learning (re)picking in SeisComP

A universal memory dumper using Frida

EsViT: Efficient self-supervised Vision Transformers

Simple ONNX operation generator. Simple Operation Generator for ONNX.

A collection of scripts I developed for personal and working projects.

Facial detection, landmark tracking and expression transfer library for Windows, Linux and Mac

Optimizing Value-at-Risk and Conditional Value-at-Risk of Black Box Functions with Lacing Values (LV)

The fastai book, published as Jupyter Notebooks

Learning to Reach Goals via Iterated Supervised Learning

Code for generating a single image pretraining dataset

List of papers, code and experiments using deep learning for time series forecasting

UV matrix decompostion using movielens dataset

Rax is a Learning-to-Rank library written in JAX

Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".