当前位置：网站首页>Pytorch learning record (V): back propagation + gradient based optimizer (SGD, adagrad, rmsporp, Adam)

Pytorch learning record (V): back propagation + gradient based optimizer (SGD, adagrad, rmsporp, Adam)

2022-04-23 05:54:00 【Zuo Xiaotian ^ o^】

Back propagation algorithm

The chain rule

Finding partial derivatives
Insert picture description here

Back propagation

Insert picture description here

Sigmoid Example of function

Insert picture description here

Back propagation algorithm , This is the core of the optimization algorithm in deep learning , Because all gradient based optimization algorithms need to calculate the gradient of each parameter

Variations of various optimization algorithms

Detailed explanation of each parameter of the optimizer ：https://www.cnblogs.com/sddai/p/14627785.html
1. Gradient descent method
Insert picture description here
2.SGD Random gradient descent method
Is to use one batch at a time （batch） Calculate the gradient of the data , Instead of calculating the gradient of all the data .

Formula is ： Updated parameter data = Parameter data - Learning rate * Parameter gradient
The code is as follows ：

def sgd_update(parameters, lr):
    for param in parameters:
        param.data = param.data - lr * param.grad.data

Detailed code ：

import numpy as np
import torch
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
from torch import nn
from torch.autograd import Variable
import time
import matplotlib.pyplot as plt

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
# batch_size=1 when 

#  Define data preprocessing functions 
def data_tf(x):
    x = np.array(x, dtype='float32') / 255   #  Change the data to 0-1
    x = (x - 0.5) / 0.5   #  Standardization 
    x = x.reshape((-1,))  #  Flatten 
    x = torch.from_numpy(x)   #  Turn it into Tensor
    return x

train_set = MNIST('./data', train=True, transform=data_tf, download=True)
test_set = MNIST('./data', train=False, transform=data_tf, download=True)

#  Definition Loss function 
criterion = nn.CrossEntropyLoss()

#  Define the function of gradient descent 
#  The formula ： Parameter data  -  Learning rate  *  gradient 
#  What is passed in is the parameters of the network , And the learning rate , Output data after gradient descent 
def sgd_updata(parameters, lr):
    for param in parameters:
        param.data = param.data - lr * param.grad.data

#  Define the training set 
train_data = DataLoader(train_set, batch_size =1, shuffle=True)

#  Use Sequential Definition 3 Layer neural networks 
net = nn.Sequential(
    nn.Linear(784, 200),
    nn.ReLU(),
    nn.Linear(200, 10)
)

#  Start training 
losses1 = []  #  The empty container , cycles 
idx = 0  #  Training times 
start = time.time()  #  Start timing 

for e in range(5):
    train_loss = 0  #  Initial training loss is 0
    for im, label in train_data:
        #  Read the data in the data , Stored in Variable in 
        im = Variable(im)
        label = Variable(label)
        #  Forward propagation 
        out = net(im)
        loss = criterion(out, label)
        #  Back propagation 
        net.zero_grad()  #  Gradient clear 
        loss.backward()  #  Back propagation 
        sgd_updata(net.parameters(), 1e-2)  #  gradient descent , Use 0.01 Learning rate of 
        #  Recording error 
        train_loss += loss.item()
        if idx % 30 == 0:
            losses1.append(loss.item())
        idx += 1
    print('epoch: {}, Train loss: {:.6f}'.format(e, train_loss / len(train_data)))
end = time.time()
print(' Use your time ：{:.5f} s'.format(end - start))

#  Draw a picture 
x_axis = np.linspace(0, 5, len(losses1), endpoint=True)
plt.semilogx(x_axis, losses1, label = 'batch_size=1')
plt.legend(loc='best')
plt.show()

Insert picture description here
take batch_size Change it to 64

The learning rate is too high, which makes the loss function jump back and forth , Thus, the loss function cannot be reduced better , So we usually use a relatively small learning rate
Pytorch The function that comes with
yes optimzier = torch.optim.SGD(net.parameters(), lr)
The specific form is as follows ：

class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]
params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
lr (float) –  Learning rate 
momentum (float,  Optional ) –  Momentum factor （ Default ：0）
weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default ：0）
dampening (float,  Optional ) –  The suppressor of momentum （ Default ：0）
nesterov (bool,  Optional ) –  Use Nesterov momentum （ Default ：False）

3.Momentum momentum

While the random gradient drops , gaining momentum
The gradient descent can be imagined as a very flat funnel , So in the vertical direction , The gradient is very large , In the horizontal direction , The gradient is relatively small , So when we set the learning rate, we can't set it too large , In order to prevent parameter updating in the vertical direction, too much , Such a small learning rate leads to too slow updating of parameters in the horizontal direction , So it leads to very slow convergence .

Insert picture description here

pytorch in torch.optim.SGD(net.parameters(), lr=1e-2, momentum=0.9) # Add momentum

import numpy as np
import torch
from torchvision.datasets import MNIST #  Import  pytorch  Built in  mnist  data 
from torch.utils.data import DataLoader
from torch import nn
from torch.autograd import Variable
import time
import matplotlib.pyplot as plt
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 #  Standardization , This technique will be discussed later 
    x = x.reshape((-1,)) #  Flatten 
    x = torch.from_numpy(x)
    return x

train_set = MNIST('./data', train=True, transform=data_tf, download=True) #  Loading data sets , Declare the defined data transformation 
test_set = MNIST('./data', train=False, transform=data_tf, download=True)

#  Definition  loss  function 
criterion = nn.CrossEntropyLoss()

train_data = DataLoader(test_set, batch_size=64, shuffle=True)

#  Define the network model 
net = nn.Sequential(
    nn.Linear(784, 200),
    nn.ReLU(),
    nn.Linear(200, 10)
)
optimizer = torch.optim.SGD(net.parameters(), lr=1e-2, momentum=0.9)
losses = []
idx = 0
start = time.time()
for e in range(5):
    train_loss = 0
    for im, label in train_data:
        im = Variable(im)
        label = Variable(label)
        #  Forward propagation 
        out = net(im)
        loss = criterion(out, label)
        #  Back propagation 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        #  Recording error 
        train_loss += loss.item()
        if idx % 30 == 0: # 30  Step record once 
            losses.append(loss.item())
        idx += 1
    print('epoch:{}, Train Loss: {:.6f}'.format(e, train_loss / len(train_data)))
end = time.time()
print(' Use your time :{:.5f} s'.format(end - start))

x_axis = np.linspace(0, 5, len(losses), endpoint=True)
plt.semilogy(x_axis, losses, label='momentum: 0.9')
plt.legend(loc='best')
plt.show()

4.Adagrad Adaptive learning rate （adaptive）
Insert picture description here
Adagrad My idea is very simple , One at a time batch size When updating the parameters of the data , We need to calculate the gradient of all parameters , So the idea is for each parameter , Initialize a variable s by 0, Then sum the gradient squares of the parameter each time and add them to this variable s On , Then when updating this parameter , The learning rate becomes
Insert picture description here

Insert picture description here
Define your own adagrad function ：

def sgd_adagrad(parameters, sqrs, lr):
    eps = 1e-10
    for param, sqr in zip(parameters, sqrs):
        sqr[:] = sqr + param.grad.data ** 2
        div = lr / torch.sqrt(sqr + eps) * param.grad.data
        param.data = param.data - div

pytorch The command optimizer = torch.optim.Adagrad(net.parameters(), lr=1e-2)

import numpy as np
import torch
from torchvision.datasets import MNIST #  Import  pytorch  Built in  mnist  data 
from torch.utils.data import DataLoader
from torch import nn
from torch.autograd import Variable
import time
import matplotlib.pyplot as plt
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 #  Standardization , This technique will be discussed later 
    x = x.reshape((-1,)) #  Flatten 
    x = torch.from_numpy(x)
    return x

train_set = MNIST('./data', train=True, transform=data_tf, download=True) #  Loading data sets , Declare the defined data transformation 
test_set = MNIST('./data', train=False, transform=data_tf, download=True)

#  Definition  loss  function 
criterion = nn.CrossEntropyLoss()

train_data = DataLoader(train_set, batch_size=64, shuffle=True)
#  Use  Sequential  Definition  3  Layer neural networks 
net = nn.Sequential(
    nn.Linear(784, 200),
    nn.ReLU(),
    nn.Linear(200, 10),
)

optimizer = torch.optim.Adagrad(net.parameters(), lr=1e-2)
#  Start training 

start = time.time()  #  Time begins 
for e in range(5):
    train_loss = 0
    for im, label in train_data:
        im = Variable(im)
        label = Variable(label)
        #  Forward propagation 
        out = net(im)
        loss = criterion(out, label)
        #  Back propagation 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        #  Recording error 
        train_loss += loss.item()
    print('epoch: {}, Train Loss: {:.6f}'
          .format(e, train_loss / len(train_data)))
end = time.time()  #  End of the timing 
print(' Use your time : {:.5f} s'.format(end - start))

5.RMSprop An improved method of adaptive learning rate
Insert picture description here

class torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]
 Realization RMSprop Algorithm .
 from G. Hinton Put forward in his course . The central version first appeared in Generating Sequences With Recurrent Neural Networks.

 Parameters ：
params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
lr (float,  Optional ) –  Learning rate （ Default ：1e-2）
momentum (float,  Optional ) –  Momentum factor （ Default ：0）
alpha (float,  Optional ) –  Smoothing constant （ Default ：0.99）
eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-8）
centered (bool,  Optional ) –  If True, Computing centric RMSProp, And its variance prediction value is used to normalize the gradient 
weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）

import numpy as np
import torch
from torchvision.datasets import MNIST #  Import  pytorch  Built in  mnist  data 
from torch.utils.data import DataLoader
from torch import nn
from torch.autograd import Variable
import time
import matplotlib.pyplot as plt
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 #  Standardization , This technique will be discussed later 
    x = x.reshape((-1,)) #  Flatten 
    x = torch.from_numpy(x)
    return x

train_set = MNIST('./data', train=True, transform=data_tf, download=True) #  Loading data sets , Declare the defined data transformation 
test_set = MNIST('./data', train=False, transform=data_tf, download=True)

#  Definition  loss  function 
criterion = nn.CrossEntropyLoss()

train_data = DataLoader(train_set, batch_size=64, shuffle=True)
#  Use  Sequential  Definition  3  Layer neural networks 
net = nn.Sequential(
    nn.Linear(784, 200),
    nn.ReLU(),
    nn.Linear(200, 10),
)

optimizer = torch.optim.RMSprop(net.parameters(), lr=1e-3, alpha=0.9)

#  Start training 

start = time.time()  #  Time begins 
for e in range(5):
    train_loss = 0
    for im, label in train_data:
        im = Variable(im)
        label = Variable(label)
        #  Forward propagation 
        out = net(im)
        loss = criterion(out, label)
        #  Back propagation 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        #  Recording error 
        train_loss += loss.item()
    print('epoch: {}, Train Loss: {:.6f}'
          .format(e, train_loss / len(train_data)))
end = time.time()  #  End of the timing 
print(' Use your time : {:.5f} s'.format(end - start))

Adam
RMSprop Plus momentum （Momentum）
It's better than RMSProp Better results
Insert picture description here
pytorch in torch.optim.Adam()

import numpy as np
import torch
from torchvision.datasets import MNIST #  Import  pytorch  Built in  mnist  data 
from torch.utils.data import DataLoader
from torch import nn
from torch.autograd import Variable
import time
import matplotlib.pyplot as plt
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 #  Standardization , This technique will be discussed later 
    x = x.reshape((-1,)) #  Flatten 
    x = torch.from_numpy(x)
    return x

train_set = MNIST('./data', train=True, transform=data_tf, download=True) #  Loading data sets , Declare the defined data transformation 
test_set = MNIST('./data', train=False, transform=data_tf, download=True)

#  Definition  loss  function 
criterion = nn.CrossEntropyLoss()

train_data = DataLoader(train_set, batch_size=64, shuffle=True)
#  Use  Sequential  Definition  3  Layer neural networks 
net = nn.Sequential(
    nn.Linear(784, 200),
    nn.ReLU(),
    nn.Linear(200, 10),
)

optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)

#  Start training 
start = time.time()  #  Time begins 
for e in range(5):
    train_loss = 0
    for im, label in train_data:
        im = Variable(im)
        label = Variable(label)
        #  Forward propagation 
        out = net(im)
        loss = criterion(out, label)
        #  Back propagation 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        #  Recording error 
        train_loss += loss.item()
    print('epoch: {}, Train Loss: {:.6f}'
          .format(e, train_loss / len(train_data)))
end = time.time()  #  End of the timing 
print(' Use your time : {:.5f} s'.format(end - start))

Adam As the default optimization algorithm , It can often achieve better results , meanwhile SGD+Momentum The method is also worth trying

Adadelta
pytorch in torch.optim.Adadelta(net.parameters(), rho=0.9)

class torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)[source]
 Realization Adadelta Algorithm .

 It's in ADADELTA: An Adaptive Learning Rate Method. It was proposed that .

 Parameters ：

params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
rho (float,  Optional ) –  Coefficient used to calculate the operating average of the square gradient （ Default ：0.9）
eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-6）
lr (float,  Optional ) –  stay delta The coefficient that is scaled before being applied to the parameter update （ Default ：1.0）
weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）