当前位置：网站首页>[pytorch image classification] RESNET network structure

[pytorch image classification] RESNET network structure

2022-04-21 18:15:00 【Stephen-Chen】

Catalog

1. introduction

2. Network innovation

Residual- Residual block

Batch Normalization Approval planning level

The migration study

3. Network architecture

4. Code implementation

5. summary

1. introduction

ResNet Is in 2015 Year by year He Kaiming Bring up the , Capture the year ImageNet The first place in the classification task in the competition , First place in target detection task , get COCO First place in target detection in data set , First place in image segmentation ,NB. The original paper is ：Deep Residual Learning for Image Recognition.

As the network goes deeper , What happened

The phenomenon of the disappearance of gradients

The gradient correlation returned by the network will be worse and worse , Close to white noise , The gradient update is also close to random disturbance

The following is a degenerate （ Poor performance in both training and test sets ）, Not too fitting （ Good performance on the training set , The test set is poor ）

The deeper the network shown in the figure below , The higher the error rate

2. Network innovation

Residual- Residual block

The depth residual framework is introduced ,

Let the convolution network take the learning residual mapping

Instead of every complete fitting of the stacked network, the potential fitting function

Compared with directly optimizing potential mapping H(x), Optimizing residual mapping is more tolerant

Be careful ： Add it and then go through ReLU Activation function .
Be careful ： The main branch and shortcut Of the output characteristic matrix of the branch shape Must be consistent to add elements . review ,GoogleNet Is to splice in the depth direction .
The output of down sampling by maximizing the pool is [56,56,64]
Just the input required for the solid line residual structure shape

How to understand Residual Well ？

Suppose we require the mapping of the solution to be ： H ( x ) H(x) H(x). Now let's turn this problem into solving the residual mapping function of the network , That is to say F ( x ) F(x) F(x), among F ( x ) = H ( x ) − x F(x) = H(x)-x F(x)=H(x)−x. The residual is the difference between the observed value and the estimated value . here H ( x ) H(x) H(x) Is the observed value , x x x It's an estimate （ That is, the upper floor Residual Output feature mapping ）. We usually call it x x x by Identity Function（ Identity transformation ）, It's a jump connection ; call F ( x ) F(x) F(x) by Residual Function.

Then the problem we want to solve becomes H ( x ) = F ( x ) + x H(x) = F(x)+x H(x)=F(x)+x. A little partner may wonder , Why do we have to go through F ( x ) F(x) F(x) Then we solve H ( x ) H(x) H(x) ah ！ Why is it so troublesome , Can't you do it directly , Neural networks are so strong （ Three layer full connection can fit any function ）！ Let's analyze ： If a general convolutional neural network is used , What we originally asked for is H ( x ) = F ( x ) H(x) = F(x) H(x)=F(x) Is this value right ？ that , Let's now assume , When my network reaches a certain depth , Our network has reached its optimal state , in other words , At this time, the error rate is the lowest , Further deepening the network will lead to the problem of degradation （ The problem of rising error rate ）. It will be very troublesome for us to update the weight of the next layer network , The value of weight is to make the next layer network also the optimal state , Right ？ We assume that the input-output feature size remains unchanged , Then the next best state is to learn an identity map , It's best not to change the input characteristics , In this way, the subsequent calculation will keep , Wrong, just like the shallow layer . But it's hard , Just imagine , Here you are 3 × 3 3 \times 3 3×3 Convolution , Mathematically, the convolution kernel parameters of identity mapping have a , That is, the middle is 1, Others are 0. But it can't be learned if you want to learn , Especially when the initialization weight is far away .

But using residual network can solve this problem well . Or suppose the depth of the current network can minimize the error rate , If we continue to increase our ResNet, In order to ensure that the network state of the next layer is still the optimal state , We just need to order F ( x ) = 0 F(x)=0 F(x)=0 That's it ！ because x x x Is the optimal solution of the current output , In order to make it the optimal solution of the next layer, that is, we hope our output H ( x ) = x H(x)=x H(x)=x Words , Just let F ( x ) = 0 F(x)=0 F(x)=0 That's it ？ This is very convenient , As long as the convolution kernel parameters are small enough , After multiplication and addition is 0 ah . Of course, the above mentioned is only the ideal situation , When we are actually testing x x x It must be difficult to achieve the best , But there will always be a time when it can be infinitely close to the optimal solution . use Residual Words , Only a small update F ( x ) F(x) F(x) Part of the weight value is OK ！ You don't have to fight like a normal convolution layer ！

Be careful ： If the residual mapping ( F ( x ) F(x) F(x)) The dimension of the result is connected with the jump ( x x x) Different dimensions of , Then we can't add the two of them , Must be right x x x Upgrade the dimension , Only when they have the same dimension can they calculate . There are two ways to upgrade dimension ：

whole 0 fill
use 1×1 Convolution

Bottleneck The realization of the class

import  torch
from  torch import  nn


class Bottleneck(nn.Module):

 # The two convoluted channels in the residual block are represented by 64->256,256->1024, So by 4 that will do 

    def __init__(self,in_dim,out_dim,stride = 1):
        super(Bottleneck,self).__init__()
        # The network stack layer uses 1*1 3*3 1*1 These three convolutions make up , There is BN layer 
        self.bottleneck = nn.Sequential(
            nn.Conv2d(in_channels=in_dim,out_channels=in_dim,kernel_size=1,bias=False),
            nn.BatchNorm2d(in_dim),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=in_dim,out_channels=in_dim,kernel_size=3,padding=1,bias=False),
            nn.BatchNorm2d(in_dim),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=in_dim,out_channels=out_dim,kernel_size=1,padding=1,bias=False),
            nn.BatchNorm2d(out_dim)
        )
        self.relu =nn.ReLU(inplace=True)
        #Downsample  Part is a that contains BN Layer of 1*1 Convolution composition 
        ''' utilize DownSample The structure changes the number of channels of identity mapping to be the same as the convolution stack layer , So that we can add '''
        self.downslape = nn.Sequential(
            nn.Conv2d(in_channels=in_dim,out_channels=out_dim,kernel_size=1,padding=1,stride=1),
            nn.BatchNorm2d(out_dim)
        )

    def forward(self,x):
        identity  = x
        out = self.bottleneck(identity)
        identity = self.downslape(x)

        # take identity（ Identity mapping ） Add to the network stack layer output . And pass by Relu Post output 
        out +=identity
        out  =self.relu(out)
        return out


bottleneck_1 = Bottleneck(64,256)
print(bottleneck_1)

input = torch.randn(1,64,56,56)
out = bottleneck_1(input)
print(out.shape)

Batch Normalization Approval planning level

Here we can see the previous interpretation ： Deep learning theory -BN layer

The migration study

advantage

Can quickly train to an ideal result

When the data set is small, it can also train the ideal effect

understand ：

Take the shallow information learned from the previous network , Become a general recognition method
Be careful ： When using the parameters of others' pre training model , Pay attention to other people's pretreatment methods

Common transfer learning methods ： It is generally recommended to train all parameters after loading weights

3. Network architecture

For the residual structure requiring down sampling （conv_3, conv_4, conv_5 The first residual structure of ）, The paper uses the following form （ The original paper has several forms , What we are talking about here is the form chosen by the last author ）, The main line 3×3 The convolution layer uses stride = 2, Implement downsampling ; The dotted line 1 × 1 1 \times 1 1×1 Convolution and stride = 2：

4. Code implementation



class ResNet(nn.Module):
    def __init__(self,block,blocks_num,num_class=1000,include_top =True,groups =1):
        super(ResNet, self).__init__()
        self.include_top = include_top
        self.in_channel = 64
        self.groups = groups

        self.conv1 = nn.Conv2d(3,self.in_channel,kernel_size=7,stride=2,padding=3,bias=False)
        self.bn1 = nn.BatchNorm2d(self.in_channel)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3,stride=2)
        self.layer1 = self._make_layer(block, 64, blocks_num[0])
        self.layer2 = self._make_layer(block, 128, blocks_num[1],stride=2)
        self.layer3 = self._make_layer(block, 256, blocks_num[2],stride=2)
        self.layer1 = self._make_layer(block, 512, blocks_num[0],stride=2)

        if self.include_top:
            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
            self.fc = nn.Linear(512 * block.expansion, num_class)


    def _make_layer(self,block,channel,block_num,stride=1) ->nn.Sequential:
        downsample = None
        if stride != 1 or self.in_channel != channel*block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channel,channel*block.expansion,kernel_size=1,stride=stride,bias=False),
                nn.BatchNorm2d(channel*block.expansion)
            )

        layers = []
        layers.append(block(self.in_channel,channel,downsample=downsample,stride=stride))
        self.in_channel = channel*block.expansion
        # Put the residual structure of the solid line in 
        for  _ in range(1,block_num):
            layers.append(block(self.in_channel,channel))

        return  nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        if self.include_top:
            x = self.avgpool(x)
            x = torch.flatten(x, 1)
            x = self.fc(x)

        return x



def resNet34(num_class=1000,include_top=True)->ResNet:
    return  ResNet(BasicBlock,[3,4,6,3],num_class=num_class,include_top=include_top)


def resNet50(num_class=1000,include_top=True):
    return  ResNet(Bottleneck,[3,4,5,6],num_class=num_class,include_top=include_top)


def resNet101(num_class=1000,include_top=True):
    return  ResNet(Bottleneck,[3,4,23,3],num_class=num_class,include_top=include_top)

5. summary

Learn nested functions （nested function） It is an ideal situation for training neural networks . In deep neural networks , Learn another layer as identity mapping （identity function） Easier （ Although this is an extreme case ）.

Residual mapping makes it easier to learn the same function , For example, approximate the parameters in the weight layer to zero .

Using the residual block （residual blocks） It can train an effective deep neural network ： Input can be propagated faster through residual connections between layers .

Residual network （ResNet） It has a far-reaching impact on the subsequent deep neural network design .

版权声明
本文为[Stephen-Chen]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204211810506017.html