Keras attention models including botnet,CoaT,CoAtNet,CMT,cotnet,halonet,resnest,resnext,resnetd,volo,mlp-mixer,resmlp,gmlp,levit

Last update: Dec 28, 2022

Overview

Keras_cv_attention_models

Keras_cv_attention_models
Usage
- Basic Usage
- Layers
- Model surgery
- AotNet
- ResNetD
- ResNeXt
- ResNetQ
- BotNet
- VOLO
- ResNeSt
- HaloNet
- CoTNet
- CoAtNet
- CMT
- CoaT
- MLP mixer
- ResMLP
- GMLP
- LeViT
Other implemented keras models

Usage

Basic Usage

Current under works: CMT, CoAtNet training.

Install as pip package:

pip install -U keras-cv-attention-models
# Or
pip install -U git+https://github.com/leondgarse/keras_cv_attention_models

Refer to each sub directory for detail usage.

Basic model prediction

from keras_cv_attention_models import volo
mm = volo.VOLO_d1(pretrained="imagenet")

""" Run predict """
import tensorflow as tf
from tensorflow import keras
from skimage.data import chelsea
img = chelsea() # Chelsea the cat
imm = keras.applications.imagenet_utils.preprocess_input(img, mode='torch')
pred = mm(tf.expand_dims(tf.image.resize(imm, mm.input_shape[1:3]), 0)).numpy()
pred = tf.nn.softmax(pred).numpy()  # If classifier activation is not softmax
print(keras.applications.imagenet_utils.decode_predictions(pred)[0])
# [('n02124075', 'Egyptian_cat', 0.9692954),
#  ('n02123045', 'tabby', 0.020203391),
#  ('n02123159', 'tiger_cat', 0.006867502),
#  ('n02127052', 'lynx', 0.00017674894),
#  ('n02123597', 'Siamese_cat', 4.9493494e-05)]

Exclude model top layers by set num_classes=0

from keras_cv_attention_models import resnest
mm = resnest.ResNest50(num_classes=0)
print(mm.output_shape)
# (None, 7, 7, 2048)

Layers

attention_layers is __init__.py only, which imports core layers defined in model architectures. Like RelativePositionalEmbedding from botnet, outlook_attention from volo.

from keras_cv_attention_models import attention_layers
aa = attention_layers.RelativePositionalEmbedding()
print(f"{aa(tf.ones([1, 4, 14, 16, 256])).shape = }")
# aa(tf.ones([1, 4, 14, 16, 256])).shape = TensorShape([1, 4, 14, 16, 14, 16])

Model surgery

model_surgery including functions used to change model parameters after built.

from keras_cv_attention_models import model_surgery
# Replace all ReLU with PReLU
mm = model_surgery.replace_ReLU(keras.applications.ResNet50(), target_activation='PReLU')

AotNet

Keras AotNet is just a ResNet / ResNetV2 like framework, that set parameters like attn_types and se_ratio and others, which is used to apply different types attention layer.

# Mixing se and outlook and halo and mhsa and cot_attention, 21M parameters
# 50 is just a picked number that larger than the relative `num_block`
from keras_cv_attention_models import aotnet
attn_types = [None, "outlook", ["mhsa", "halo"] * 50, "cot"]
se_ratio = [0.25, 0, 0, 0]
mm = aotnet.AotNet50V2(attn_types=attn_types, se_ratio=se_ratio, deep_stem=True, strides=1)

ResNetD

Keras ResNetD includes implementation of PDF 1812.01187 Bag of Tricks for Image Classification with Convolutional Neural Networks

Model	Params	Image resolution	Top1 Acc	Download
ResNet50D	25.58M	224	80.530	resnet50d.h5
ResNet101D	44.57M	224	83.022	resnet101d.h5
ResNet152D	60.21M	224	83.680	resnet152d.h5
ResNet200D	64.69	224	83.962	resnet200d.h5

ResNeXt

Keras ResNeXt includes implementation of PDF 1611.05431 Aggregated Residual Transformations for Deep Neural Networks
SWSL means Semi-Weakly Supervised ResNe*t from Github facebookresearch/semi-supervised-ImageNet1K-models. Please note the CC-BY-NC 4.0 license on theses weights, non-commercial use only.

Model	Params	Image resolution	Top1 Acc	Download
ResNeXt50 (32x4d)	25M	224	79.768	resnext50_imagenet.h5
- SWSL	25M	224	82.182	resnext50_swsl.h5
ResNeXt50D (32x4d + deep)	25M	224	79.676	resnext50d_imagenet.h5
ResNeXt101 (32x4d)	42M	224	80.334	resnext101_imagenet.h5
- SWSL	42M	224	83.230	resnext101_swsl.h5
ResNeXt101W (32x8d)	89M	224	79.308	resnext101_imagenet.h5
- SWSL	89M	224	84.284	resnext101w_swsl.h5

ResNetQ

Keras ResNetQ includes implementation of Github timm/models/resnet.py

Model	Params	Image resolution	Top1 Acc	Download
ResNet51Q	35.7M	224	82.36	resnet51q.h5

BotNet

Keras BotNet is for PDF 2101.11605 Bottleneck Transformers for Visual Recognition.

Model	Params	Image resolution	Top1 Acc	Download
botnet50	21M	224	77.604	botnet50_imagenet.h5
botnet101	41M	224
botnet152	56M	224

VOLO

Keras VOLO is for PDF 2106.13112 VOLO: Vision Outlooker for Visual Recognition.

Model	Params	Image resolution	Top1 Acc	Download
volo_d1	27M	224	84.2	volo_d1_224.h5
volo_d1 ↑384	27M	384	85.2	volo_d1_384.h5
volo_d2	59M	224	85.2	volo_d2_224.h5
volo_d2 ↑384	59M	384	86.0	volo_d2_384.h5
volo_d3	86M	224	85.4	volo_d3_224.h5
volo_d3 ↑448	86M	448	86.3	volo_d3_448.h5
volo_d4	193M	224	85.7	volo_d4_224.h5
volo_d4 ↑448	193M	448	86.8	volo_d4_448.h5
volo_d5	296M	224	86.1	volo_d5_224.h5
volo_d5 ↑448	296M	448	87.0	volo_d5_448.h5
volo_d5 ↑512	296M	512	87.1	volo_d5_512.h5

ResNeSt

Keras ResNeSt is for PDF 2004.08955 ResNeSt: Split-Attention Networks.

Model	Params	Image resolution	Top1 Acc	Download
resnest50	28M	224	81.03	resnest50.h5
resnest101	49M	256	82.83	resnest101.h5
resnest200	71M	320	83.84	resnest200.h5
resnest269	111M	416	84.54	resnest269.h5

HaloNet

Keras HaloNet is for PDF 2103.12731 Scaling Local Self-Attention for Parameter Efficient Visual Backbones.

Model	Params	Image resolution	Top1 Acc
HaloNetH0	6.6M	256	77.9
HaloNetH1	9.1M	256	79.9
HaloNetH2	10.3M	256	80.4
HaloNetH3	12.5M	320	81.9
HaloNetH4	19.5M	384	83.3
- 21k	19.5M	384	85.5
HaloNetH5	31.6M	448	84.0
HaloNetH6	44.3M	512	84.4
HaloNetH7	67.9M	600	84.9

CoTNet

Keras CoTNet is for PDF 2107.12292 Contextual Transformer Networks for Visual Recognition.

Model	Params	Image resolution	FLOPs	Top1 Acc	Download
CoTNet-50	22.2M	224	3.3	81.3	cotnet50_224.h5
CoTNeXt-50	30.1M	224	4.3	82.1
SE-CoTNetD-50	23.1M	224	4.1	81.6	se_cotnetd50_224.h5
CoTNet-101	38.3M	224	6.1	82.8	cotnet101_224.h5
CoTNeXt-101	53.4M	224	8.2	83.2
SE-CoTNetD-101	40.9M	224	8.5	83.2	se_cotnetd101_224.h5
SE-CoTNetD-152	55.8M	224	17.0	84.0	se_cotnetd152_224.h5
SE-CoTNetD-152	55.8M	320	26.5	84.6	se_cotnetd152_320.h5

CoAtNet

Keras CoAtNet is for PDF 2106.04803 CoAtNet: Marrying Convolution and Attention for All Data Sizes.

Model	Params	Image resolution	Top1 Acc
CoAtNet-0	25M	224	81.6
CoAtNet-1	42M	224	83.3
CoAtNet-2	75M	224	84.1
CoAtNet-2, ImageNet-21k pretrain	75M	224	87.1
CoAtNet-3	168M	224	84.5
CoAtNet-3, ImageNet-21k pretrain	168M	224	87.6
CoAtNet-3, ImageNet-21k pretrain	168M	512	87.9
CoAtNet-4, ImageNet-21k pretrain	275M	512	88.1
CoAtNet-4, ImageNet-21K + PT-RA-E150	275M	512	88.56

CMT

Keras CMT is for PDF 2107.06263 CMT: Convolutional Neural Networks Meet Vision Transformers.

Model	Params	Image resolution	Top1 Acc
CMTTiny	9.5M	160	79.2
CMTXS	15.2M	192	81.8
CMTSmall	25.1M	224	83.5
CMTBig	45.7M	256	84.5

CoaT

Keras CoaT is for PDF 2104.06399 CoaT: Co-Scale Conv-Attentional Image Transformers.

Model	Params	Image resolution	Top1 Acc	Download
CoaTLiteTiny	5.7M	224	77.5	coat_lite_tiny_imagenet.h5
CoaTLiteMini	11M	224	79.1	coat_lite_mini_imagenet.h5
CoaTLiteSmall	20M	224	81.9	coat_lite_small_imagenet.h5
CoaTTiny	5.5M	224	78.3	coat_tiny_imagenet.h5
CoaTMini	10M	224	81.0	coat_mini_imagenet.h5

MLP mixer

Keras MLP mixer includes implementation of PDF 2105.01601 MLP-Mixer: An all-MLP Architecture for Vision.
Models Top1 Acc is Pre-trained on JFT-300M model accuray on ImageNet 1K from paper.

Model	Params	Top1 Acc	ImageNet	Imagenet21k	ImageNet SAM
MLPMixerS32	19.1M	68.70
MLPMixerS16	18.5M	73.83
MLPMixerB32	60.3M	75.53			b32_imagenet_sam.h5
MLPMixerB16	59.9M	80.00	b16_imagenet.h5	b16_imagenet21k.h5	b16_imagenet_sam.h5
MLPMixerL32	206.9M	80.67
MLPMixerL16	208.2M	84.82	l16_imagenet.h5	l16_imagenet21k.h5
- input 448	208.2M	86.78
MLPMixerH14	432.3M	86.32
- input 448	432.3M	87.94

ResMLP

Keras ResMLP includes implementation of PDF 2105.03404 ResMLP: Feedforward networks for image classification with data-efficient training

Model	Params	Image resolution	Top1 Acc	ImageNet
ResMLP12	15M	224	77.8	resmlp12_imagenet.h5
ResMLP24	30M	224	80.8	resmlp24_imagenet.h5
ResMLP36	116M	224	81.1	resmlp36_imagenet.h5
ResMLP_B24	129M	224	83.6	resmlp_b24_imagenet.h5
- imagenet22k	129M	224	84.4	resmlp_b24_imagenet22k.h5

GMLP

Keras GMLP includes implementation of PDF 2105.08050 Pay Attention to MLPs.

Model	Params	Image resolution	Top1 Acc	ImageNet
GMLPTiny16	6M	224	72.3
GMLPS16	20M	224	79.6	gmlp_s16_imagenet.h5
GMLPB16	73M	224	81.6

LeViT

Keras LeViT is for PDF 2104.01136 LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference.

Model	Params	Image resolution	Top1 Acc	ImageNet
LeViT128S	7.8M	224	76.6	levit128s_imagenet.h5
LeViT128	9.2M	224	78.6	levit128_imagenet.h5
LeViT192	11M	224	80.0	levit192_imagenet.h5
LeViT256	19M	224	81.6	levit256_imagenet.h5
LeViT384	39M	224	82.6	levit384_imagenet.h5

Other implemented keras models

Comments

TPU support for VOLO

While trying VOLO with TPU I'm getting this error, any idea how to reolve this?

InvalidArgumentError: 9 root error(s) found.
  (0) Invalid argument: {{function_node __inference_train_function_137027}} Compilation failure: Detected unsupported operations when trying to compile graph cluster_train_function_5876961707884240013[] on XLA_TPU_JIT: ExtractImagePatches (No registered 'ExtractImagePatches' OpKernel for XLA_TPU_JIT devices compatible with node {{node gradient_tape/model/unfold_matmul_fold_3/ExtractImagePatches}}
	 (OpKernel was found, but attributes didn't match) Requested Attributes: T=DT_INT64, _xla_inferred_shapes=[[1,?,?,9]], ksizes=[1, 3, 3, 1], padding="VALID", rates=[1, 1, 1, 1], strides=[1, 2, 2, 1], _device="/device:TPU_REPLICATED_CORE"){{node gradient_tape/model/unfold_matmul_fold_3/ExtractImagePatches}}One approach is to outside compile the unsupported ops to run on CPUs by enabling soft placement `tf.config.set_soft_device_placement(True)`. This has a potential performance penalty.
	TPU compilation failed
	 [[tpu_compile_succeeded_assert/_17543318848583046929/_5]]
	 [[tpu_compile_succeeded_assert/_17543318848583046929/_5/_127]]
  (1) Invalid argument: {{function_node __inference_train_function_137027}} Compilation failure: Detected unsupported operations when trying to compile graph cluster_train_function_5876961707884240013[] on XLA_TPU_JIT: ExtractImagePatches (No registered 'ExtractImagePatches' OpKernel for XLA_TPU_JIT devices compatible with node {{node gradient_tape/model/unfold_matmul_fold_3/ExtractImagePatches}}
	 (OpKernel was found, but attributes didn't match) Requested Attributes: T=DT_INT64, _xla_inferred_shapes=[[1,?,?,9]], ksizes=[1, 3, 3, 1], padding="VALID", rates=[1, 1, 1, 1], strides=[1, 2, 2, 1], _device="/device:TPU_REPLICATED_CORE"){{node gradient_tape/model/unfold_matmul_fold_3/ExtractImagePatches}}One approach is to outside compile the unsupported ops to run on CPUs by enabling soft placement `tf.config.set_soft_device_placement(True)`. This has a potential performance penalty.
	TPU compilation failed
	 [[tpu_compile_succeeded_assert/_17543318848583046929/_5]]
	 [[tpu_compile_succeeded_assert/_17543318848583046929/_5/_103]]
  (2) Invalid argument: {{function_node __inference_train_function_137027}} Compilation failure: Detected unsupported operations when trying to compile graph cluster_train_function_5876961707884240013[] on XLA_TPU_JIT: ExtractImagePatches (No registered 'ExtractImagePatches' OpKernel for XLA_TPU_JIT devices compatible with node {{node gradient_tape/model/unfold_matmul_fold_3/ExtractImagePatches}}
	 (OpKernel was found, but attributes didn't match) Requested Attributes: T=DT_INT64, _xla_inferred_shapes=[[1,?,?,9]], ksizes=[1, 3, 3, 1], padding="VALID", rates=[1, 1, 1, 1], strides=[1, 2, 2, 1], _device="/device:TPU_REPLICATED_CORE"){{node gradient_tape/model/unfold_matmul_fold_3/ExtractImagePatches}}One approach is to outside compile the unsupported ops to run on CPUs by enabling soft placement `tf.config.set_soft_device_placement(True)`. This has a potential performance penalty.
	TPU compilation failed
	 [[tpu_compile_succeeded_assert/_17543318848583046929 ... [truncated]

enhancement

opened by awsaf49 14

Use YoloR with swin transformer as backbone.

@leondgarse I am trying to get inference using yolor with swin backbone but getting the following results. What can be the issue?

from keras_cv_attention_models import efficientnet, yolor
from keras_cv_attention_models import swin_transformer_v2

from keras_cv_attention_models import efficientnet, yolor
bb = swin_transformer_v2.SwinTransformerV2Small_window16(input_shape=(256, 256, 3), num_classes=1000)
model = yolor.YOLOR(backbone=bb) 

from keras_cv_attention_models import test_images
imm = test_images.dog_cat()
preds = model(model.preprocess_input(imm))
bboxs, lables, confidences = model.decode_predictions(preds)[0]

from keras_cv_attention_models.coco import data
data.show_image_with_bboxes(imm, bboxs, lables, confidences)

resulting output download

opened by farazBhatti 10

MobileViT

Tried to run MobileViT_S model with input shape 256, 256, 3 and got the following error

UnimplementedError Traceback (most recent call last) in () 2 3 history = model.fit(get_training_dataset_with_oversample(repeat_dataset=True, oversample=True), steps_per_epoch=STEPS_PER_EPOCH, epochs=EPOCHS, ----> 4 validation_data=get_validation_dataset(), validation_steps=VALIDATION_STEPS) 5

1 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in _numpy(self) 1189 return self._numpy_internal() 1190 except core._NotOkStatusException as e: # pylint: disable=protected-access -> 1191 raise core._status_to_exception(e) from None # pylint: disable=protected-access 1192 1193 @property

UnimplementedError: 9 root error(s) found. (0) UNIMPLEMENTED: {{function_node __inference_train_function_1032011}} Dynamic input dimension to reshape that is both splitted and combined is not supported %dynamic-reshape.13585 = f32[<=32,16,4,2304]{3,2,1,0} dynamic-reshape(f32[<=1024,2,16,144]{3,1,2,0} %transpose.13551, s32[] %divide.13584, s32[] %reshape.13571, s32[] %reshape.13574, s32[] %reshape.13577), metadata={op_type="Reshape" op_name="while/body/_1/while/mobilevit_s/tf.reshape_1/Reshape"} [[{{function_node while_body_1010992}}{{node while/TPUReplicateMetadata}}]] (1) UNIMPLEMENTED: {{function_node __inference_train_function_1032011}} Dynamic input dimension to reshape that is both splitted and combined is not supported %dynamic-reshape.13585 = f32[<=32,16,4,2304]{3,2,1,0} dynamic-reshape(f32[<=1024,2,16,144]{3,1,2,0} %transpose.13551, s32[] %divide.13584, s32[] %reshape.13571, s32[] %reshape.13574, s32[] %reshape.13577), metadata={op_type="Reshape" op_name="while/body/_1/while/mobilevit_s/tf.reshape_1/Reshape"} [[{{function_node while_body_1010992}}{{node while/TPUReplicateMetadata}}]] [[while/body/_1/while/strided_slice_35/_445]] (2) UNIMPLEMENTED: {{function_node __inference_train_function_1032011}} Dynamic input dimension to reshape that is both splitted and combined is not supported %dynamic-reshape.13585 = f32[<=32,16,4,2304]{3,2,1,0} dynamic-reshape(f32[<=1024,2,16,144]{3,1,2,0} %transpose.13551, s32[] %divide.13584, s32[] %reshape.13571, s32[] %reshape.13574, s32[] %reshape.13577), metadata={op_type="Reshape" op_name="while/body/_1/while/mobilevit_s/tf.reshape_1/Reshape"} [[{{function_node while_body_1010992}}{{node while/TPUReplicateMetadata}}]] [[while/body/_1/while/strided_slice_23/_381]] (3) UNIMPLEMENTED: {{function_node __inference_train_function_1032011}} Dynamic input dimension to reshape that is both splitted and combined is not supported %dynamic-reshape.13585 = f32[<=32,16,4,2304]{3,2,1,0} dynamic-reshape(f32[<=1024,2,16,144]{3,1,2,0} %transpose.13551, s32[] %divide.13584, s32[] %reshape.13571, s32[] %reshape.13574, s32[] %reshape.13577), metadata={op_type="Reshape" op_name="while/body/_1/while/mobilevit_s/tf.reshape_1/Reshape"} [[{{function_node while_body_1010992}}{{node while/TPUReplicateMetadata}}]] [[while/body/_1/while/Pad_8/_407]] (4) UNIMPLEMENTED: {{function_node __inference_train_function_1032011}} Dynamic input dimension to reshape that is both splitted and combined is not supported %dynamic-reshape.13585 = f32[<=32,16,4,2304]{3,2,1,0} dynamic-reshape(f32[<=1024,2,16,144]{3,1,2,0} %transpose.13551, s32[] %divide.13584, s32[] %reshape.13571, s32[] %reshape.13574, s32[] %reshape.13577), metadata={op_type="Reshape" op_name="while/body/_1/while/mobilevit_s/tf.reshape_1/Reshape"} [[{{function_node while_body_1010992}}{{node while/TPUReplicateMetadata}}]] [[while/body/_1/while/Maximum_2/y/_341]] (5) UNIMPLEMENTED: {{function_node __inference_train_function_1032011}} Dynamic input dimension to reshape that is both splitted and combined is not supported %dynamic-reshape.13585 = f3 ... [truncated]
bug good first issue

opened by KyloRen1 10
[General Questions] Rough estimates for training time for pre-training CoAtNet?

Hi, 👋 Thanks for such an amazing library and taking out the time to implement so many parts of the CoatNet paper!

In your CoAtNet README, you mentioned you use TPU accelerators. Could you provide a ballpark for the amount of time it took for you to train the biggest models and the corresponding accelerators? I have a task for which I wish to use scaled-up models, but I'd have to pre-train on Imagenet first because of low data amount (<5-10M) and squeeze out maximum accuracy from fine-tuning.

I assume there might've been a few bottlenecks also, perhaps data? 🤔 If you could describe your setup, it would be very helpful to my experiments!

Sorry for bothering you with minor questions, and again thank you for all your work!

opened by neel04 9
Visualize saliency map with the attention models

It would be great if some functional code could be included for plotting attention maps using the attention models. Such a functionality has been provided for the vision transformer models at https://github.com/faustomorales/vit-keras. Thanks and looking forward.
enhancement good first issue

opened by sivaramakrishnan-rajaraman 9

How to save models ?

@leondgarse I want to save the models in saved_model format. How to do that? When I am attempting it, it is showing me the error

WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.

What can be the soluion for this?

Code:

import os
from keras_cv_attention_models import mobilevit
pretrained = '/content/mobilevit_xxs_imagenet.h5'
model = mobilevit.MobileViT_XXS(pretrained=pretrained)
model.save('mobilevit_xxs_imagenet1k')

opened by sayannath 7

The order of height and width seems wrong in `tf.meshgrid(range(height), range(width))`
In Line 44 of beit.py, you use tf.meshgrid(range(height), range(width)), while it should be tf.meshgrid(range(width), range(height)), isn't it?

When I ran the code from Line 44 to Line 52 with height=3 and width=4, it gives the output

[[17 16 15 10 9 8 3 2 1 -4 -5 -6] [18 17 16 11 10 9 4 3 2 -3 -4 -5] [19 18 17 12 11 10 5 4 3 -2 -3 -4] [24 23 22 17 16 15 10 9 8 3 2 1] [25 24 23 18 17 16 11 10 9 4 3 2] [26 25 24 19 18 17 12 11 10 5 4 3] [31 30 29 24 23 22 17 16 15 10 9 8] [32 31 30 25 24 23 18 17 16 11 10 9] [33 32 31 26 25 24 19 18 17 12 11 10] [38 37 36 31 30 29 24 23 22 17 16 15] [39 38 37 32 31 30 25 24 23 18 17 16] [40 39 38 33 32 31 26 25 24 19 18 17]], shape=(12, 12), dtype=int32)

which seems incorrect.

Of course, this is not a problem if you assume height==width, but I think tf.meshgrid(range(width), range(height)) gives more readability and can potentially prevent bugs if height != width is supported in the future.
bug enhancement
opened by xskxzr 6
Training of YoloXS Model on Coco dataset
Hi, I am currently reproducing the coco training on YoloXS model with line below:

python leondgarse/coco_train_script.py --det_header yolox.YOLOXS --data_name coco/2014 --batch_size 16

After my training using 30 epochs, I am getting poor result, as

# Show result from keras_cv_attention_models.coco import data data.show_image_with_bboxes(imm, bboxs, labels, confidences, num_classes=80)

Do I have anything configure wrongly? Or any suggestion could I change? Thanks!
opened by ThePaperFish 5

Update for EdgeNeXt

I reproduced EdgeNeXt based on torch and your project， Is there any mistake with this code？ Why can't it show all layers details，looks like it's missing some layers in “summary”

import tensorflow as tf
from tensorflow import keras
from keras_cv_attention_models.common_layers import (
    layer_norm, activation_by_name
)
from tensorflow.keras import initializers
from keras_cv_attention_models.attention_layers import (
    conv2d_no_bias,
    drop_block,
)
import math

BATCH_NORM_DECAY = 0.9
BATCH_NORM_EPSILON = 1e-5
TF_BATCH_NORM_EPSILON = 0.001
LAYER_NORM_EPSILON = 1e-5


@tf.keras.utils.register_keras_serializable(package="EdgeNeXt")
class PositionalEncodingFourier(keras.layers.Layer):
    def __init__(self, hidden_dim=32, dim=768, temperature=10000):
        super(PositionalEncodingFourier, self).__init__()
        self.token_projection = tf.keras.layers.Conv2D(dim, kernel_size=1)
        self.scale = 2 * math.pi
        self.temperature = temperature
        self.hidden_dim = hidden_dim
        self.dim = dim
        self.eps = 1e-6

    def __call__(self, B, H, W, *args, **kwargs):
        mask_tf = tf.zeros([B, H, W])
        not_mask_tf = 1 - mask_tf
        y_embed_tf = tf.cumsum(not_mask_tf, axis=1)
        x_embed_tf = tf.cumsum(not_mask_tf, axis=2)
        y_embed_tf = y_embed_tf / (y_embed_tf[:, -1:, :] + self.eps) * self.scale  # 2 * math.pi
        x_embed_tf = x_embed_tf / (x_embed_tf[:, :, -1:] + self.eps) * self.scale  # 2 * math.pi
        dim_t_tf = tf.range(self.hidden_dim, dtype=tf.float32)
        dim_t_tf = self.temperature ** (2 * (dim_t_tf // 2) / self.hidden_dim)
        pos_x_tf = x_embed_tf[:, :, :, None] / dim_t_tf
        pos_y_tf = y_embed_tf[:, :, :, None] / dim_t_tf
        pos_x_tf = tf.reshape(tf.stack([tf.math.sin(pos_x_tf[:, :, :, 0::2]),
                                        tf.math.cos(pos_x_tf[:, :, :, 1::2])], axis=4),
                              shape=[B, H, W, self.hidden_dim])
        pos_y_tf = tf.reshape(tf.stack([tf.math.sin(pos_y_tf[:, :, :, 0::2]),
                                        tf.math.cos(pos_y_tf[:, :, :, 1::2])], axis=4),
                              shape=[B, H, W, self.hidden_dim])
        pos_tf = tf.concat([pos_y_tf, pos_x_tf], axis=-1)
        pos_tf = self.token_projection(pos_tf)

        return pos_tf

    def get_config(self):
        base_config = super().get_config()
        base_config.update({"token_projection": self.token_projection, "scale": self.scale,
                            "temperature": self.temperature, "hidden_dim": self.hidden_dim,
                            "dim": self.dim, "eps": self.eps})
        return base_config


def EdgeNeXt(input_shape=(256, 256, 3), depths=[3, 3, 9, 3], dims=[24, 48, 88, 168],
             global_block=[0, 0, 0, 3], global_block_type=['None', 'None', 'None', 'SDTA'],
             drop_path_rate=1, layer_scale_init_value=1e-6, head_init_scale=1., expan_ratio=4,
             kernel_sizes=[7, 7, 7, 7], heads=[8, 8, 8, 8], use_pos_embd_xca=[False, False, False, False],
             use_pos_embd_global=False, d2_scales=[2, 3, 4, 5], epsilon=1e-6, model_name='EdgeNeXt'):
    inputs = keras.layers.Input(input_shape, batch_size=2)

    nn = conv2d_no_bias(inputs, dims[0], kernel_size=4, strides=4, padding="valid", name="stem_")
    nn = layer_norm(nn, epsilon=epsilon, name='stem_')

    drop_connect_rates = tf.linspace(0, stop=drop_path_rate, num=int(
        sum(depths)))  # drop_connect_rates_split(num_blocks, start=0.0, end=drop_connect_rate)
    cur = 0
    for i in range(4):
        for j in range(depths[i]):
            if j > depths[i] - global_block[i] - 1:
                if global_block_type[i] == 'SDTA':
                    SDTA_encoder(dim=dims[i], drop_path=drop_connect_rates[cur + j],
                                 expan_ratio=expan_ratio, scales=d2_scales[i],
                                 use_pos_emb=use_pos_embd_xca[i], num_heads=heads[i], name='stage_'+str(i)+'_SDTA_encoder_'+str(j))(nn)
                else:
                    raise NotImplementedError
            else:
                if i != 0 and j == 0:
                    nn = layer_norm(nn, epsilon=epsilon, name='stage_' + str(i) + '_')
                    nn = conv2d_no_bias(nn, dims[i], kernel_size=2, strides=2, padding="valid",
                                        name='stage_' + str(i) + '_')

                Conv_Encoder(dim=dims[i], drop_path=drop_connect_rates[cur + j],
                             layer_scale_init_value=layer_scale_init_value,
                             expan_ratio=expan_ratio, kernel_size=kernel_sizes[i], name='stage_'+str(i)+'_Conv_Encoder_'+str(j) + '_')(nn)  # drop_connect_rates[cur + j]

    model = keras.models.Model(inputs, nn, name=model_name)
    return model


@tf.keras.utils.register_keras_serializable(package="EdgeNeXt")
class Conv_Encoder(keras.layers.Layer):
    def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6, expan_ratio=4, kernel_size=7, epsilon=1e-6,
                 name=''):

        super(Conv_Encoder, self).__init__()
        self.encoder_name = name
        self.gamma = tf.Variable(layer_scale_init_value * tf.ones(dim), trainable=True,
                                 name=name + 'gamma') if layer_scale_init_value > 0 else None
        self.drop_path = drop_path
        self.dim = dim
        self.expan_ratio = expan_ratio
        self.kernel_size = kernel_size
        self.epsilon = epsilon

    def __call__(self, x, *args, **kwargs):
        inputs = x
        x = keras.layers.Conv2D(self.dim, kernel_size=self.kernel_size, padding="SAME", name=self.encoder_name +'Conv2D')(x)
        x = layer_norm(x, epsilon=self.epsilon, name=self.encoder_name)
        x = keras.layers.Dense(self.expan_ratio * self.dim)(x)
        x = activation_by_name(x, activation="gelu")
        x = keras.layers.Dense(self.dim)(x)
        if self.gamma is not None:
            x = self.gamma * x

        x = inputs + drop_block(x, drop_rate=0.)

        return x

    def get_config(self):
        base_config = super().get_config()
        base_config.update({"gamma": self.gamma, "drop_path": self.drop_path,
                            "dim": self.dim, "expan_ratio": self.expan_ratio,
                            "kernel_size": self.kernel_size})
        return base_config


@tf.keras.utils.register_keras_serializable(package="EdgeNeXt")
class SDTA_encoder(keras.layers.Layer):
    def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6, expan_ratio=4,
                 use_pos_emb=True, num_heads=8, qkv_bias=True, attn_drop=0., drop=0., scales=1, zero_gamma=False,
                 activation='gelu', use_bias=False, name='sdf'):
        super(SDTA_encoder, self).__init__()
        self.expan_ratio = expan_ratio
        self.width = max(int(math.ceil(dim / scales)), int(math.floor(dim // scales)))
        self.width_list = [self.width] * (scales - 1)
        self.width_list.append(dim - self.width * (scales - 1))
        self.dim = dim
        self.scales = scales
        if scales == 1:
            self.nums = 1
        else:
            self.nums = scales - 1
        self.pos_embd = None
        if use_pos_emb:
            self.pos_embd = PositionalEncodingFourier(dim=dim)
        self.xca = XCA(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
        self.gamma_xca = tf.Variable(layer_scale_init_value * tf.ones(dim), trainable=True,
                                     name=name + 'gamma') if layer_scale_init_value > 0 else None
        self.gamma = tf.Variable(layer_scale_init_value * tf.ones(dim), trainable=True,
                                 name=name + 'gamma') if layer_scale_init_value > 0 else None
        self.drop_rate = drop_path
        self.drop_path = keras.layers.Dropout(drop_path)
        gamma_initializer = tf.zeros_initializer() if zero_gamma else tf.ones_initializer()
        self.norm = keras.layers.LayerNormalization(epsilon=LAYER_NORM_EPSILON, gamma_initializer=gamma_initializer,
                                                    name=name and name + "ln")
        self.norm_xca = keras.layers.LayerNormalization(epsilon=LAYER_NORM_EPSILON, gamma_initializer=gamma_initializer,
                                                        name=name and name + "norm_xca")
        self.activation = activation
        self.use_bias = use_bias

    def get_config(self):
        base_config = super().get_config()
        base_config.update({"width": self.width, "dim": self.dim,
                            "nums": self.nums, "pos_embd": self.pos_embd,
                            "xca": self.xca, "gamma_xca": self.gamma_xca,
                            "gamma": self.gamma, "norm": self.norm,
                            "activation": self.activation, "use_bias": self.use_bias,
                            })
        return base_config

    def __call__(self, inputs, *args, **kwargs):
        x = inputs
        spx = tf.split(inputs, self.width_list, axis=-1)
        for i in range(self.nums):
            if i == 0:
                sp = spx[i]
            else:
                sp = sp + spx[i]
            sp = keras.layers.Conv2D(self.width, kernel_size=3, padding='SAME')(sp)  # , groups=self.width
            if i == 0:
                out = sp
            else:
                out = tf.concat([out, sp], -1)
        inputs = tf.concat([out, spx[self.nums]], -1)

        # XCA
        B, H, W, C = inputs.shape
        inputs = tf.reshape(inputs, (-1, H * W, C))  # tf.transpose(), perm=[0, 2, 1])

        if self.pos_embd:
            pos_encoding = tf.reshape(self.pos_embd(B, H, W), (-1, H * W, C))
            inputs += pos_encoding

        if self.gamma_xca is not None:
            inputs = self.gamma_xca * inputs
        input_xca = self.gamma_xca * self.xca(self.norm_xca(inputs))
        inputs = inputs + drop_block(input_xca, drop_rate=self.drop_rate, name="SDTA_encoder_")
        inputs = tf.reshape(inputs, (-1, H, W, C))

        # Inverted Bottleneck
        inputs = self.norm(inputs)
        inputs = keras.layers.Conv2D(self.expan_ratio * self.dim, kernel_size=1, use_bias=self.use_bias)(inputs)
        inputs = activation_by_name(inputs, activation=self.activation)
        inputs = keras.layers.Conv2D(self.dim, kernel_size=1, use_bias=self.use_bias)(inputs)
        if self.gamma is not None:
            inputs = self.gamma * inputs

        x = x + self.drop_path(inputs)
        return x


@tf.keras.utils.register_keras_serializable(package="EdgeNeXt")
class XCA(keras.layers.Layer):
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0., name=""):
        super(XCA, self).__init__()
        self.num_heads = num_heads
        self.temperature = tf.Variable(tf.ones(num_heads, 1, 1), trainable=True, name=name + 'gamma')

        self.qkv = keras.layers.Dense(dim * 3, use_bias=qkv_bias)
        self.attn_drop = keras.layers.Dropout(attn_drop)
        self.k_ini = initializers.GlorotUniform()
        self.b_ini = initializers.Zeros()
        self.proj = keras.layers.Dense(dim, name="out",
                                       kernel_initializer=self.k_ini, bias_initializer=self.b_ini)
        self.proj_drop = keras.layers.Dropout(proj_drop)

    def __call__(self, inputs, training=None, *args, **kwargs):
        input_shape = inputs.shape
        qkv = self.qkv(inputs)
        qkv = tf.reshape(qkv, (input_shape[0], input_shape[1], 3,
                               self.num_heads,
                               input_shape[2] // self.num_heads))  # [batch, hh * ww, 3, num_heads, dims_per_head]
        qkv = tf.transpose(qkv, perm=[2, 0, 3, 4, 1])  # [3, batch, num_heads, dims_per_head, hh * ww]
        query, key, value = tf.split(qkv, 3, axis=0)  # [batch, num_heads, dims_per_head, hh * ww]

        norm_query, norm_key = tf.nn.l2_normalize(tf.squeeze(query), axis=-1, epsilon=1e-6), \
                               tf.nn.l2_normalize(tf.squeeze(key), axis=-1, epsilon=1e-6)
        attn = tf.matmul(norm_query, norm_key, transpose_b=True)
        attn = tf.transpose(tf.transpose(attn, perm=[0, 2, 3, 1]) * self.temperature, perm=[0, 3, 2, 1])

        attn = tf.nn.softmax(attn, axis=-1)
        attn = self.attn_drop(attn, training=training)  # [batch, num_heads, hh * ww, hh * ww]

        x = tf.matmul(attn, value)  # [batch, num_heads, hh * ww, dims_per_head]
        x = tf.reshape(x, [input_shape[0], input_shape[1], input_shape[2]])

        x = self.proj(x)
        x = self.proj_drop(x)

        return x

    def get_config(self):
        base_config = super().get_config()
        base_config.update({"num_heads": self.num_heads, "temperature": self.temperature,
                            "qkv": self.qkv, "attn_drop": self.attn_drop,
                            "proj": self.proj, "proj_drop": self.proj_drop})
        return base_config


def edgenext_xx_small(pretrained=False, **kwargs):
    # 1.33M & 260.58M @ 256 resolution
    # 71.23% Top-1 accuracy
    # No AA, Color Jitter=0.4, No Mixup & Cutmix, DropPath=0.0, BS=4096, lr=0.006, multi-scale-sampler
    # Jetson FPS=51.66 versus 47.67 for MobileViT_XXS
    # For A100: FPS @ BS=1: 212.13 & @ BS=256: 7042.06 versus FPS @ BS=1: 96.68 & @ BS=256: 4624.71 for MobileViT_XXS
    model = EdgeNeXt(depths=[2, 2, 6, 2], dims=[24, 48, 88, 168], expan_ratio=4,
                     global_block=[0, 1, 1, 1],
                     global_block_type=['None', 'SDTA', 'SDTA', 'SDTA'],
                     use_pos_embd_xca=[False, True, False, False],
                     kernel_sizes=[3, 5, 7, 9],
                     heads=[4, 4, 4, 4],
                     d2_scales=[2, 2, 3, 4],
                     **kwargs)

    return model


def edgenext_x_small(pretrained=False, **kwargs):
    # 2.34M & 538.0M @ 256 resolution
    # 75.00% Top-1 accuracy
    # No AA, No Mixup & Cutmix, DropPath=0.0, BS=4096, lr=0.006, multi-scale-sampler
    # Jetson FPS=31.61 versus 28.49 for MobileViT_XS
    # For A100: FPS @ BS=1: 179.55 & @ BS=256: 4404.95 versus FPS @ BS=1: 94.55 & @ BS=256: 2361.53 for MobileViT_XS
    model = EdgeNeXt(depths=[3, 3, 9, 3], dims=[32, 64, 100, 192], expan_ratio=4,
                     global_block=[0, 1, 1, 1],
                     global_block_type=['None', 'SDTA', 'SDTA', 'SDTA'],
                     use_pos_embd_xca=[False, True, False, False],
                     kernel_sizes=[3, 5, 7, 9],
                     heads=[4, 4, 4, 4],
                     d2_scales=[2, 2, 3, 4],
                     **kwargs)

    return model


def edgenext_small(pretrained=False, **kwargs):
    # 5.59M & 1260.59M @ 256 resolution
    # 79.43% Top-1 accuracy
    # AA=True, No Mixup & Cutmix, DropPath=0.1, BS=4096, lr=0.006, multi-scale-sampler
    # Jetson FPS=20.47 versus 18.86 for MobileViT_S
    # For A100: FPS @ BS=1: 172.33 & @ BS=256: 3010.25 versus FPS @ BS=1: 93.84 & @ BS=256: 1785.92 for MobileViT_S
    model = EdgeNeXt(depths=[3, 3, 9, 3], dims=[48, 96, 160, 304], expan_ratio=4,
                     global_block=[0, 1, 1, 1],
                     global_block_type=['None', 'SDTA', 'SDTA', 'SDTA'],
                     use_pos_embd_xca=[False, True, False, False],
                     kernel_sizes=[3, 5, 7, 9],
                     d2_scales=[2, 2, 3, 4],
                     **kwargs)

    return model


if __name__ == '__main__':
    model = edgenext_small()
    model.summary()
    # from download_and_load import keras_reload_from_torch_model
    # keras_reload_from_torch_model(
    #     'D:\GitHub\EdgeNeXt\edgenext_small.pth',
    #     keras_model=model,
    #     # tail_align_dict=tail_align_dict,
    #     # full_name_align_dict=full_name_align_dict,
    #     # additional_transfer=additional_transfer,
    #     input_shape=(256, 256),
    #     do_convert=True,
    #     save_name="adaface_ir101_webface4m.h5",
    # )



```

opened by whalefa1I 5

custom layer issue at tflite conversion
Hi, thanks for the good references.

I have implemented MobileViT with your package, and tried to convert the trained model into tflite format. At there, I met an error saying,

Unknown layer: Addons>GroupNormalization. Please ensure this object is passed to the `custom_objects` argument. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details

I tried to addd custom layer name as a parameter of model load, but still facing the issue.

model = tf.keras.models.load_model('./checkpoints/model_best.h5', custom_objects={'AttentionLayer': AttentionLayer})

Is there any way to solve this?

Thanks,
bug good first issue
opened by mhyeonsoo 4

coat.CoaTMini(input_shape=(200, 240, 1) ...) error: Dimensions must be equal, but are 730 and 677 for ...

Hi,

I try to train a model with:

    model = coat.CoaTMini(input_shape=(200, 240, 1), num_classes=240, pretrained=None)

but the model cannot be build, error out with:

ValueError: Exception encountered when calling layer "tf.__operators__.add" (type TFOpLambda).

Dimensions must be equal, but are 730 and 677 for '{{node tf.__operators__.add/AddV2}} = AddV2[T=DT_FLOAT](Placeholder, Placeholder_1)' with input shapes: [?,730,216], [?,677,216].

Call arguments received:
  • x=tf.Tensor(shape=(None, 730, 216), dtype=float32)
  • y=tf.Tensor(shape=(None, 677, 216), dtype=float32)
  • name=None

I just wonder something wrong with coat?

Thanks.

bug enhancement

opened by mw66 4

Can you provide the code for converting pytorch weights to tf?

Hi. Can you provide the code for converting pytorch weights to tf, such as beit. Because I wanted to try the effect of beitv2's pre-training weights. Thanks!

opened by 131404060321 1
tflite conversion - GPU/XNNPACK fails

Hi! Thanks for great repo! I have converted the EfficientFormer model to tflite. However, applying both XNNPACK and GPU delegates fail.

GPU delegate created. INFO: Initialized TensorFlow Lite runtime. INFO: Created TensorFlow Lite delegate for GPU. Failed to apply GPU delegate. Benchmarking failed.

XNNPACK delegate created. INFO: Initialized TensorFlow Lite runtime. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Failed to apply XNNPACK delegate. Benchmarking failed.

Do you know what could be the issue? Im using latest tensorflow version for conversion.

opened by macsmy 3

Keras attention models including botnet,CoaT,CoAtNet,CMT,cotnet,halonet,resnest,resnext,resnetd,volo,mlp-mixer,resmlp,gmlp,levit

Related tags

Overview

Keras_cv_attention_models

Usage

Basic Usage

Layers

Model surgery

AotNet

ResNetD

ResNeXt

ResNetQ

BotNet

VOLO

ResNeSt

HaloNet

CoTNet

CoAtNet

CMT

CoaT

MLP mixer

ResMLP

GMLP

LeViT

Other implemented keras models

Comments

Releases(yolov7)

yolov7(Nov 27, 2022)

ghostnetv2(Nov 18, 2022)

maxvit(Oct 17, 2022)

hornet(Sep 12, 2022)

efficientformer(Jul 23, 2022)

gcvit(Jul 19, 2022)

edgenext(Jul 14, 2022)

nat(May 11, 2022)

davit(Apr 26, 2022)

mobilenetv3_family(Apr 13, 2022)

swin_transformer_v2(Apr 6, 2022)

mobilevit(Apr 2, 2022)

cmt(Mar 21, 2022)

yolor(Mar 15, 2022)

uniformer(Mar 9, 2022)

yolox(Feb 18, 2022)

efficientdet(Jan 26, 2022)

convnext(Jan 12, 2022)

coatnet(Dec 28, 2021)

beit(Oct 26, 2021)

halonet(Sep 30, 2021)

nfnets(Sep 26, 2021)

levit(Aug 27, 2021)

resnet_family(Aug 23, 2021)

mlp_family(Aug 23, 2021)

coat(Aug 19, 2021)

cotnet(Aug 5, 2021)

volo(Aug 2, 2021)

resnest(Aug 2, 2021)

botnet(Aug 2, 2021)

Owner

Algorithmic encoding of protected characteristics and its implications on disparities across subgroups

A PyTorch implementation of Radio Transformer Networks from the paper "An Introduction to Deep Learning for the Physical Layer".

Iranian Cars Detection using Yolov5s, PyTorch

This repository contains the code used in the paper "Prompt-Based Multi-Modal Image Segmentation".

Convert game ISO and archives to CD CHD for emulation on Linux.

Implementation of paper "Towards a Unified View of Parameter-Efficient Transfer Learning"

PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML)

custom pytorch implementation of MoCo v3

Geometric Vector Perceptrons --- a rotation-equivariant GNN for learning from biomolecular structure

Individual Tree Crown classification on WorldView-2 Images using Autoencoder -- Group 9 Weak learners - Final Project (Machine Learning 2020 Course)

Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

Implementation for ACProp ( Momentum centering and asynchronous update for adaptive gradient methdos, NeurIPS 2021)

Physics-informed Neural Operator for Learning Partial Differential Equation

Progressive Coordinate Transforms for Monocular 3D Object Detection

Add-on for importing and auto setup of character creator 3 character exports.

Gesture Volume Control Using OpenCV and MediaPipe

Official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model.

Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Creating a custom CNN hypertunned architeture for the Fashion MNIST dataset with Python, Keras and Tensorflow.

Activity tragle - Google is tracking everything, we just look at it