当前位置：网站首页>PyTorch 20. Pytorch tips (continuously updated)

PyTorch 20. Pytorch tips (continuously updated)

2022-04-23 07:29:00 【DCGJ666】

PyTorch skill （ Continuous updating ）

View the output details of each layer of the model
Gradient cut （Gradient Clipping）
Expand the dimension of a single picture
Hot coding alone
Prevent model validation from exploding
- Monitoring tools
- Memory occupation
Freeze the parameters of some layers
Use different learning rates for different levels
retain_graph Use

View the output details of each layer of the model

from torchsummary import summary
summary(your_model, input_size=(channels, H, W))

input_size It is set according to the input size of your own network model

Gradient cut （Gradient Clipping）

import torch.nn as nn
outputs = model(data)
loss = loss_fn(outputs, target)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()

nn.utils.clip_grad_norm_ Parameters of ：

parameters A variable based iterator , It's going to do gradient normalization
max_norm The maximum norm of the gradient
norm_type Specify the type of norm , The default is 2

Expand the dimension of a single picture

view() Realization

import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size)

img = image.view(1, *image.size())
print(img.size())

np.newaxis Realization

import cv2
import torch

image = cv2.imread(img_path)
print(image.shape)
img = image[np.newaxis, :, :, :]
print(img.shape)

unsqueeze() Realization

import cv2
import torch

iamge = cv2.imread(img_path)
image = torch.tensor(image)
print(img.size())

img = image.unsqueeze(dim=0)
print(img.size())

Hot coding alone

stay PyTorch When cross entropy loss function is used in the label convert to onehot, So you don't have to manually Convert , While using MSE It needs to be manually converted to onehot code

import torch.nn.functional as F
import torch

tensor = torch.arange(0, 5)
one_hot = F.one_hot(tensor)
#  Output ：
# tensor([[1, 0, 0],
# [0, 1, 0],
# [0, 0, 1],
# [1, 0, 0],
# [0, 1, 0]])

F.one_hot They will test the number of different categories by themselves , Generate the corresponding unique hot code , We can also specify the number of categories ：

tensor =  torch.arange(0, 5) % 3  # tensor([0, 1, 2, 0, 1])
one_hot = F.one_hot(tensor, num_classes=5)

#  Output ：
# tensor([[1, 0, 0, 0, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0],
# [0, 1, 0, 0, 0]])

Prevent model validation from exploding

There is no need to derive when validating the model , That is, there is no need for gradient calculation , close autograd, It can speed up , To save memory , If you don't turn it off, it may explode

with torch.no_grad():
	#  Use model The code that makes predictions 
	pass

For the use of torch.cuda.empty_cache() Why , Because with Pytorch Training for , There may be more and more useless temporary variables , Lead to out of memory.

It means Pytorch The cache allocator will allocate some fixed memory in advance , Even if it's actually tensors I haven't used up all this memory , These memories can't be used by other applications . This distribution process was first CUDA Memory access triggers .

and torch.cuda.empty_cache() Is to release the unused cache memory currently held by the cache allocator , So that these memories can be used by other GPU Used in applications , Note that using this command does not release tensors Occupied video memory .

Monitoring tools

sudo apt-get install htop # Monitor memory （-d For update frequency ）
htop -d=0.1

watch -n 0.1 nvidia-smi # Monitor video memory （-n For update frequency , Every time 0.1s Updated once ）

Pytorch-Memory-Utils Monitor the occupation of video memory

Memory occupation

Memory occupation = Model parameters + Calculate the resulting intermediate variable
Methods to reduce the occupation of video memory

inplace Replace
use del Clear intermediate variables while calculating
Reduce batch_size, Avoid full connection , Multi use down sampling
Because each iteration will introduce some temporary variables , It will cause the training speed to be slower and slower , Basically linear growth . But if you use it periodically torch.cuda.empty_cache() This problem can be solved .

Freeze the parameters of some layers

When loading the pre training model , We sometimes want to freeze the front layers , So that its parameters do not change during the training process .
We need to know the name of each layer first , Print with the following code ：

net = Network() # Get custom network structure 
for name, value in net.named_parameters():
	print('name: {0},\t grad: {1}'.format(name, value.requires_grad))

Suppose the first few layers of information are as follows ：

name: cnn.VGG_16.convolution1_1.weight,	 grad: True
name: cnn.VGG_16.convolution1_1.bias,	 grad: True
name: cnn.VGG_16.convolution1_2.weight,	 grad: True
name: cnn.VGG_16.convolution1_2.bias,	 grad: True
name: cnn.VGG_16.convolution2_1.weight,	 grad: True
name: cnn.VGG_16.convolution2_1.bias,	 grad: True
name: cnn.VGG_16.convolution2_2.weight,	 grad: True
name: cnn.VGG_16.convolution2_2.bias,	 grad: True

hinder True Indicates that the parameters of the layer are trainable , Then we define a list of layers to freeze

no_grad = [
    'cnn.VGG_16.convolution1_1.weight',
    'cnn.VGG_16.convolution1_1.bias',
    'cnn.VGG_16.convolution1_2.weight',
    'cnn.VGG_16.convolution1_2.bias'
]

The freezing method is as follows ：

net = Net.CTPN()  #  Get the network structure 
for name, value in net.named_parameters():
    if name in no_grad:
        value.requires_grad = False
    else:
        value.requires_grad = True

Finally, when defining the optimizer , Only right requires_grad by True To update the parameters of the layer

optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)

Explicitly specify model.train() and model.eval()
There are often some sub models in our model , Its parameters during training and testing are different , such as dropout Discard rate and Batch Normalization Medium $\gamma$ and $\beta$ etc. , At this point, we need to explicitly specify different stages , stay pytorch In which we pass model.train() and model.eval() Explicitly specify （ because BN Of running_mean Such not nn.Parameter, use requires_grad Can't freeze , Need to call BN Of .eval()）

Use different learning rates for different levels

Take the following model as an example ：

net = Network()  #  Get custom network structure 
for name, value in net.named_parameters():
    print('name: {}'.format(name))

#  Output ：
# name: cnn.VGG_16.convolution1_1.weight
# name: cnn.VGG_16.convolution1_1.bias
# name: cnn.VGG_16.convolution1_2.weight
# name: cnn.VGG_16.convolution1_2.bias
# name: cnn.VGG_16.convolution2_1.weight
# name: cnn.VGG_16.convolution2_1.bias
# name: cnn.VGG_16.convolution2_2.weight
# name: cnn.VGG_16.convolution2_2.bias

Yes convolution1 and convolution2 Set different learning rates , First, separate them , Put it in a different list ：

conv1_params = []
conv2_params = []

for name, params in net.named_parameters():
	if "convolution1" in name:
		conv1_params += [params]
	else:
		conv2_params += [params]
#  Then do the following in the optimizer :
optimizer = optim.Adam(
[
	{
    "params": conv1_params, 'lr':0.01},
	{
    "params": conv2_params, 'lr':0.001},
],
weight_decay = 1e-3
)

We divide the model into two parts , Put it in a list , Each part corresponds to a dictionary above , Set different learning rates in the dictionary , When the two parts have the same other parameters , Just put the parameters outside the list as global parameters , As above ‘weight_decay’.

You can also set a global learning rate outside the list , When the local learning rate is set in each part of the dictionary , Use the learning rate , Otherwise, the global learning rate out of the list is used .

retain_graph Use

When back propagating a loss , stay pytorch Call in out.backward() That is to say ,
Yes loss The gradient of the loss function to the learning parameter can be obtained by back propagation , stay .backward() in ,

backward(gradient=None, retain_graph=None, create_graph=False)

Here we focus on retain_graph This parameter , If this parameter is False perhaps None After the back propagation , Release the built graph, If True It's not right graph To release .

But we've already calculated the gradient , Why save graph Well ？ Here's an example , For example, against the generation network GAN You need to train a module, such as a generator , Then train the discriminator , At this time, the whole network will have more than two loss,

G_loss = ...
D_loss = ...

opt.zero_grad() # Clear all gradients 0
D_loss.backward(retain_graph=True) # preservation graph structure , Use later 
opt.step() #  Update gradient , Update only D Gradient of , Because only D The gradient of is not 0

opt.zero_grad() #  Clear all gradients 0
G_loss.backward(retain_graph=False) # Do not save graph structure , You can release graph
#  Pass in the next iteration forward just so so build come out 
opt.step() # Update gradient , Update only G Of , Because only G Not for 0