当前位置：网站首页>ConvNeXt

ConvNeXt

2022-04-21 09:52:00 【Relearn CS】

ConvNeXt

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-joTcP64v-1650164682676)(https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2c930934-c41f-4bdb-8df9-dc15e9f3fc23/Untitled.png)]

notes ： The following notes refer to the original paper and This blog It's simplified .
Insert picture description here

Macro Design

Changing stage ratio: stay ResNet in , commonly conv4_x(stage3) Stacked block Maximum number ,ResNet50 in , each stage Stacked block The number is about （3,4,6,3）, The ratio is 1：1：2：1.Swin Transformer in stage3 A higher proportion of , so take ResNet The number of stacks is from （3,4,6,3） Adjusted for （3,3,9,3）, and Swin-T There are similar ones FLOPs, After the adjustment , Accuracy from 78.8% Rise to 79.4%
Changing stem to "Patchify”：ResNet The down sampling in is stride by 2 Of 7x7 Convolution layer and stride by 2 The maximum pool sampling is composed of .Transformer In general Through one stride be equal to kernel_size The convolution layer realizes down sampling , It looks like putting a patch Through the above convolution layer, the lower sampling is 1 Pixel , The down sampling multiple is equal to kernel_size. The accuracy after replacement is from 79.4% Upgrade to 79.5%, also FLOPs Also reduced. .

ResNeXt-ify

ResNeXt A group convolution is used in the so that it is in FLOPs and accuracy A good balance has been achieved in . here ConvNeXt In the direct Used MobileNet Proposed in depthwise-conv Replace ResBkock Medium 3x3 Convolution . At the same time, the author will The number of channels ranges from 64 Turned into 96, Final The accuracy is up to 80.5%.

Inverted Bottleneck

The author thinks that Transformer block Medium MLP Module and MobileNetV2 Medium Inverted bottlenect The structure is similar to , They are both thin at both ends and thick in the middle .ConvNeXt Finally, the structure is shown in the figure below a It is amended as follows c,b by MobileNetV2 The use of Inverted bottleneck. author use Inverted bottleneck After that, the accuracy on the smaller model ranges from 80.5% Promoted to 80.6%, On larger models, the accuracy ranges from 81.9% Upgrade to 82.6%, also FLOPs Reduced by a small margin

Insert picture description here

Large kernel size

because Transformer In general, we do the overall situation self-attention, such as vision transformer, and Swin Transformer There are also 7x7 The window size of . But at present, the mainstream neural networks use 3x3 A window the size of , Because before VGG It is mentioned that stacking multiple 3x3 The convolution layer can replace the larger convolution layer , And 3x3 The optimization of convolution layer is more efficient .

Moving up depthwise convolution: take 1x1conv→depthwise conv → 1x1conv Turn into depthwise conv→1x1conv→1x1conv, The accuracy dropped to 79.9%, meanwhile FLOPs Also reduced
Increasing kernel size： modify depth wise The convolution kernel of is 7x7, At the same time, the author has tried convolution kernels of other sizes , Include 3,5,7,9,11 Found to 7 The accuracy reaches saturation . also The accuracy is from 79.9%（3x3) Growth to 80.6%（7x7)

Macro Design

Focus on smaller differences , Such as activation functions and Normalization etc.

Relu→GRELU： take Relu Change activation function to GELU Activation function , There is no significant change in accuracy
fewer activation： Use fewer activation functions , In general, convolution networks will be followed by an activation function after the convolution layer or the full connection layer , however Transformer in MLP Only after the first full connection layer GELU Activation function . The author in ConvNeXt Activation functions are also reduced in ,depthwise conv→1x1conv + GELU→1x1conv, The accuracy is from 80.6% Up to 81.3%
Fewer normalization layers： The author in ConvNeXt Only in 7x7 Of depthwise conv I used Normalization, The accuracy has increased 0.1%, arrive 81.4%
Substitute BN with LN： Use layer norm Replace batch norm, The accuracy has increased 0.1%, arrive 81.5%
Separate downsampling layers： primary ResNet in stage2-stage4 The downsampling in is done by taking the main branch 3x3 The convolution step is set to 2, On the shortcut Branch 1x1 The convolution layer step of is set to 2 On going .Swin Transformer Is through a separate patch merging Realized ,ConvNeXt The author alone A lower sampling layer is used , Through one Layer norm Add a convolution kernel with a size of 2,stride by 2 The convolution layer composition of , Finally, the accuracy is improved to 82.0%

版权声明
本文为[Relearn CS]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204210944460933.html