2022-08-09 10:28:00 PaperWeekly


作者 | serendipity

单位 | 同济大学

研究方向 | 行人搜索、3D人体姿态估计



或许是 by design,但是这个 bug 目前还存在于很多很多人的代码中.就连特斯拉 AI 总监 Karpathy 也被坑过,并发了一篇推文.


事实上,The twitter is a recent one bug 引发的,该 bug It is because of forget correctly as DataLoader workers 设置随机数种子,In the whole training process accident repeating batch 数据.

2018 年 2 Has been in PyTorch 的 repo 下提了 issue [1],但是直到 2021 年 4 Month to repair.此问题只在 PyTorch 1.9 Version appeared before,涉及范围之广,甚至包括了 PyTorch 官方教程 [2]、OpenAI 的代码 [3]、NVIDIA 的代码 [4].


PyTorch DataLoader的隐藏bug

在PyTorch中加载、Preprocessing and data standard method is:继承 torch.utils.data.Dataset 并重载它的 __getitem__ 方法.In order to enhances the data,Such as random cropping and image flip,该 __getitem__ 方法通常使用 NumPy 来生成随机数.And then pass the data set to DataLoader 创建 batch.Data preprocessing is likely to be the bottleneck of network training,So sometimes need to parallel loading data,这可以通过设置 Dataloader的 num_workers 参数来实现.

We use a simple code to copy this now bug,PyTorch 版本应 <1.9,I used in the experiment is 1.6.

import numpy as np
from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):
    def __getitem__(self, index):
        return np.random.randint(0, 1000, 3)

    def __len__(self):
        return 8

dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=2)
for batch in dataloader:


tensor([[116, 760, 679],   # 第1个batch, 由进程0返回
        [754, 897, 764]])
tensor([[116, 760, 679],   # 第2个batch, 由进程1返回
        [754, 897, 764]])

tensor([[866, 919, 441],   # 第3个batch, 由进程0返回
        [ 20, 727, 680]])
tensor([[866, 919, 441],   # 第4个batch, 由进程1返回
        [ 20, 727, 680]])

We were amazed to find that each process return random number is the same!!



PyTorch 用 fork [5] Method to create more child process parallel loading data.This means that each child processes can inherit the parent process all resources,包括 Numpy The state of the random number generator.



注: spawn  Method is to build from scratch a child process,Won't inherit the parent state of random number. torch.multiprocessing  在Unix 系统中默认使用  fork ,在 MacOS 和  Windows上默认是  spawn .So the problem only in Unix 上出现.当然,Can also be mandatory in MacOS 和 Windows 中使用  fork  方式创建子进程. 

DataLoaderThe constructor has an optional parameter worker_init_fn.在加载数据之前,Each child will call this function before.我们可以在 worker_init_fn 中设置 NumPy 的种子,例如:

def worker_init_fn(worker_id):
    # np.random.get_state(): 得到当前的NumpyState of the random number,The main process of random state
    # worker_id是子进程的id,如果num_workers=2,Two childid分别是0和1
    # 和worker_idAdditive can ensure that every child has a different random number seed
    np.random.seed(np.random.get_state()[1][0] + worker_id)

dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=2, worker_init_fn=worker_init_fn)

for batch in dataloader:

正如我们期望的那样,每个 batch 的值都是不同的.

tensor([[282,   4, 785],
        [ 35, 581, 521]])
tensor([[684,  17,  95],
        [774, 794, 420]])

tensor([[180, 413,  50],
        [894, 318, 729]])
tensor([[530, 594, 116],
        [636, 468, 264]])

等一下,If we again much iteration several epoch 呢?

for epoch in range(3):
    print(f"epoch: {epoch}")
    for batch in dataloader:

我们发现,虽然在一个 epoch Back to normal within,但是不同 epoch Between the repeat again.

epoch: 0
tensor([[282,   4, 785],
        [ 35, 581, 521]])
tensor([[684,  17,  95],
        [774, 794, 420]])
tensor([[939, 988,  37],
        [983, 933, 821]])
tensor([[832,  50, 453],
        [ 37, 322, 981]])
epoch: 1
tensor([[282,   4, 785],
        [ 35, 581, 521]])
tensor([[684,  17,  95],
        [774, 794, 420]])
tensor([[939, 988,  37],
        [983, 933, 821]])
tensor([[832,  50, 453],
        [ 37, 322, 981]])
epoch: 2
tensor([[282,   4, 785],
        [ 35, 581, 521]])
tensor([[684,  17,  95],
        [774, 794, 420]])
tensor([[939, 988,  37],
        [983, 933, 821]])
tensor([[832,  50, 453],
        [ 37, 322, 981]])

因为在默认情况下,每个子进程在 epoch Was killed at the end of the,All the process of resources will be lost.在开始新的 epoch 时,In the process of the main random state has not changed,Used to initialize each child process again,So the child to the random number seed and the last epoch 完全相同.

因此We need to set up a meeting with epoch The number changed by random number,例如:np.random.get_state()[1][0] + epoch + worker_id.

The random number in practice it is difficult to realize,因为在 worker_init_fn Don't know the current is which a epoch.但是 torch.initial_seed() 可以满足我们的需求.

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32

实际上,这就是 PyTorch 官方推荐的做法 [6].

Not ready to delve into the reader can already be here,以后创建 DataLoader 时,把 worker_init_fn 设置为上面的 seed_worker 函数即可.Want to understand the principle behind,请看下一节,会涉及到 DataLoader 的源码理解.



We first need to understand the processes more DataLoader 的处理流程.

1. 在主进程中实例化 DataLoader(dataset, num_workers=2)

2. 创建两个 multiprocessing.Queue [7] To tell the two child process which data should be responsible for their respective take.假设 Queue1 = [0, 2], Queue2 = [1, 3] On behalf of the first child process should be responsible for taking the first 0,2 个数据,The second process is responsible for the first 1,3 个数据.When the user to take the first index 个数据时,Main process query first which the child is free,If the second child process free,则把 index 放入到 Queue2 中. 再创建一个 result_queue [8] Used to save the child to read data,格式为 (index, dataset[index])

3. 每个 epoch 开始时,主要干两件事情.a): Randomly generated a seed [9] base_seed  b): 用 fork 方法创建 2 个子进程 [10].在每个子进程中,将 torch 和 random The random number seed set to base_seed + worker_id.Then have constantly query the respective queue data,如果有,Just get the index,从 dataset 中获取第 index 个数据 dataset[index],将结果保存到 result_queue 中. 

在子进程中运行 torch.initial_seed(),返回的就是 torch The random number seed,即 base_seed + worker_id.因为每个 epoch 开始时,The master will regenerate the a base_seed,所以  base_seed  是随 epoch Changes in the random number.此外,torch.initial_seed()返回的是 long int 类型,而 Numpy 只接受 uint 类型([0, 2**32 - 1]),所以需要对 2**32 取模.

如果我们用 torch 或者 random 生成随机数,而不是 numpy,Do not have to worry about will encounter this problem,因为 PyTorch 已经把 torch 和 random Random number set up in order to base_seed + worker_id.

综上所述,这个 bug The emergence of the need to satisfy the following two conditions:

  • PyTorch 版本 < 1.9

  • 在 Dataset 的 __getitem__ 方法中使用了 Numpy 的随机数


Some candidates.

  • pytorch-image-models [11]

    def seed_worker(worker_id):
        worker_info = torch.utils.data.get_worker_info()
        # worker_info.seed == torch.initial_seed()
        np.random.seed(worker_info.seed % 2**32)
  • @晚星 [12]

    def seed_worker(worker_id):
        seed = np.random.default_rng().integers(low=0, high=2**32, size=1)
  • @ggggnui [13]

    class WorkerInit:
        def __init__(self, global_step):
            self.global_step = global_step
        def worker_init_fn(self, worker_id):
            np.random.seed(self.global_step + worker_id)
        def update_global_step(self, global_step):
            self.global_step = global_step
    worker_init = WorkerInit(0)
    dataloader = DataLoader(dataset, batch_size=2, num_workers=2,
    for epoch in range(3):
        for batch in dataloader:
        # 需要注意的是len(dataloader)必须>=num_workers,Otherwise will repeat
        worker_init.update_global_step((epoch + 1) * len(dataloader))


文内链接 & 参考文献


[1] https://github.com/pytorch/pytorch/issues/5059

[2] https://github.com/pytorch/tutorials/blob/af754cbdaf5f6b0d66a7c5cd07ab97b349f3dd9b/beginner_source/data_loading_tutorial.py%23L270-L271

[3] https://github.com/openai/ebm_code_release/blob/18898a24ee24dcd75c41ac3e228b9db79e53237c/data.py%23L465-L470

[4] https://github.com/NVlabs/Deep_Object_Pose/blob/11bbc3b8545e099b35901a13f549ddddacd7dd1f/scripts/train.py%23L518-L521

[5] https://docs.python.org/3/library/multiprocessing.html%23contexts-and-start-methods

[6] https://pytorch.org/docs/stable/notes/randomness.html%23dataloader

[7] https://github.com/pytorch/pytorch/blob/bc3d892c20ee8cf6c765742481526f307e20312a/torch/utils/data/dataloader.py%23L897

[8] https://github.com/pytorch/pytorch/blob/bc3d892c20ee8cf6c765742481526f307e20312a/torch/utils/data/dataloader.py%23L888

[9] https://github.com/pytorch/pytorch/blob/bc3d892c20ee8cf6c765742481526f307e20312a/torch/utils/data/dataloader.py%23L495

[10] https://github.com/pytorch/pytorch/blob/bc3d892c20ee8cf6c765742481526f307e20312a/torch/utils/data/dataloader.py%23L901

[11] https://github.com/rwightman/pytorch-image-models/blob/e4360e6125bb0bb4279785810c8eb33b40af3ebd/timm/data/loader.py#L149

[12] https://www.zhihu.com/people/wan-xing-13

[13] https://www.zhihu.com/people/ggggnui

[14] https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/

[15] https://github.com/pytorch/pytorch/pull/56488






