当前位置：网站首页>Learning notes 7-depth neural network optimization

Learning notes 7-depth neural network optimization

2022-04-23 10:39:00 【When can I be as powerful as a big man】

Batch normalization （BatchNormalization）

Standardization of inputs （ Shallow model ）

The mean value of any processed feature on all samples in the data set is 0、 The standard deviation is 1.
Standardize the input data so that the distribution of each feature is similar

Batch normalization （ Depth model ）

Use the mean and standard deviation on a small batch , Constantly adjust the intermediate output of neural network , Thus, the output value of the whole neural network in the middle of each layer is more stable .

1. Batch normalization of the whole connection layer

Location ： Between affine transformation and activation function in full connection layer .
Full connection ：
$\boldsymbol{x} = \boldsymbol{W\boldsymbol{u} + \boldsymbol{b}} \\ output =\phi(\boldsymbol{x})$

Batch normalization ：
$output=\phi(\text{BN}(\boldsymbol{x}))$

$\boldsymbol{y}^{(i)} = \text{BN}(\boldsymbol{x}^{(i)})$

$\boldsymbol{\mu}_\mathcal{B} \leftarrow \frac{1}{m}\sum_{i = 1}^{m} \boldsymbol{x}^{(i)},$
$\boldsymbol{\sigma}_\mathcal{B}^2 \leftarrow \frac{1}{m} \sum_{i=1}^{m}(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B})^2,$

$\hat{\boldsymbol{x}}^{(i)} \leftarrow \frac{\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B}}{\sqrt{\boldsymbol{\sigma}_\mathcal{B}^2 + \epsilon}},$

this ⾥ϵ > 0 It's a very small constant , Ensure that the denominator is greater than 0

${\boldsymbol{y}}^{(i)} \leftarrow \boldsymbol{\gamma} \odot \hat{\boldsymbol{x}}^{(i)} + \boldsymbol{\beta}.$

Introduce learnable parameters ： Stretch parameters γ And offset parameters β. if $\boldsymbol{\gamma} = \sqrt{\boldsymbol{\sigma}_\mathcal{B}^2 + \epsilon}$ and $\boldsymbol{\beta} = \boldsymbol{\mu}_\mathcal{B}$ , Batch normalization is invalid .

2. Batch reduction of convolution layer ⼀ turn

Location ： After convolution calculation 、 Should be ⽤ Before activating the function .
If the convolution calculation outputs multiple channels , We need to normalize the output of these channels in batches , Each channel has its own stretch and offset parameters .
Calculation ： For single channel ,batchsize=m, Convolution calculation output =pxq
In this channel m×p×q Batch normalization of multiple elements at the same time , Use the same mean and variance .

3. Batch return when forecasting ⼀ turn

Training ： With batch In units of , For each batch Calculate the mean and variance .
forecast ： The moving average is used to estimate the sample mean and variance of the whole training data set .

From zero

import time
import torch
from torch import nn, optim
import torch.nn.functional as F
import torchvision
import sys
sys.path.append("/home/kesci/input/") 
import d2lzh1981 as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
    #  Judge whether the current mode is training mode or prediction mode 
    if not is_training:
        #  If it is in prediction mode , The mean and variance obtained by directly using the incoming moving average 
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            #  The case of using the full connection layer , Calculate the mean and variance on the characteristic dimension 
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            #  Use of two-dimensional convolution , Calculate the channel dimension （axis=1） The mean and variance of . Here we need to keep 
            # X So that the broadcast operation can be done later 
            mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
            var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
        #  In the training mode, the current mean and variance are used for standardization 
        X_hat = (X - mean) / torch.sqrt(var + eps)
        #  Update the mean and variance of the moving average 
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  #  Stretch and offset 
    return Y, moving_mean, moving_var

class BatchNorm(nn.Module):
    def __init__(self, num_features, num_dims):
        super(BatchNorm, self).__init__()
        if num_dims == 2:
            shape = (1, num_features) # The output neuron of the whole connection layer 
        else:
            shape = (1, num_features, 1, 1)  # The channel number 
        #  Stretch and offset parameters involved in gradient sum iteration , Initialize into 0 and 1
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        #  Variables that do not participate in the gradient sum iteration , All initialized to... In memory 0
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.zeros(shape)

    def forward(self, X):
        #  If X Not in memory , take moving_mean and moving_var Copied to the X On the video memory 
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        #  Save the updated moving_mean and moving_var, Module Example of traning Property defaults to true,  call .eval() Set it to false
        Y, self.moving_mean, self.moving_var = batch_norm(self.training, 
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y

net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            BatchNorm(6, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            BatchNorm(16, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            BatchNorm(120, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            BatchNorm(84, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )
print(net)

#batch_size = 256 
##cpu Turn it down batchsize
batch_size=16

def load_data_fashion_mnist(batch_size, resize=None, root='/home/kesci/input/FashionMNIST2065'):
    """Download the fashion mnist dataset and then load into memory."""
    trans = []
    if resize:
        trans.append(torchvision.transforms.Resize(size=resize))
    trans.append(torchvision.transforms.ToTensor())
    
    transform = torchvision.transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
    mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)

    train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
    test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)

    return train_iter, test_iter
train_iter, test_iter = load_data_fashion_mnist(batch_size)

lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)