Distributed DataLoader For Pytorch Based On Ray

Overview

Dpex——用户无感知分布式数据预处理组件

一、前言

随着GPU与CPU的算力差距越来越大以及模型训练时的预处理Pipeline变得越来越复杂,CPU部分的数据预处理已经逐渐成为了模型训练的瓶颈所在,这导致单机的GPU配置的提升并不能带来期望的线性加速。预处理性能瓶颈的本质在于每个GPU能够使用的CPU算力受限, 为了解决这个问题NVIDIA提出了scale up的方案——GPU数据预处理库DALI,Tensorflow给出了scale out的方案——分布式数据预处理组件DataService,而在这里我们给出Pytorch生态中的scale out方案——分布式数据预处理组件Dpex。

二、架构介绍(介绍Pytorch DataLoader本身的架构以及DistDataLoader的架构)

Dpex的采用了和Pytorch的DataLoader同样的架构设计并借助Ray将数据预处理任务调度至其他机器节点进行计算。

三、使用示例

不仅在设计上,Dpex的实现上也完全兼容Pytorch的DataLoader。当并行数据预处理时,若设置distribute_modeTrueDpexDataLoader使用 _RayDataLoaderIter实现分布式数据预处理,当设置为FalseDpexDataLoader退回到使用Pytorch本身的_MultiProcessingDataLoaderIter 实现并行数据预处理与加载。在Pytorch训练中使用Dpex非常的简单,只需要将使用到Pytorch的DataLoader的地方替换为Dpex中的DpexDataLoader即可,当你的训练机器本身为Ray集群中的一个节点时,设置 distribute_mode=True可以启用分布式数据预处理。在下面我们给出单卡训练,使用DataParallel进行多卡训练以及使用DDP进行多卡训练时使用Dpex的示例,具体可参考测试文件。
class DpexDataLoader(torch.utils.data.DataLoader): def init(self, dataset: Dataset[T_co], distribute_mode: Optional[bool] = False, head_address="auto", batch_size: Optional[int] = 1, shuffle: bool = False, sampler: Optional[Sampler[int]] = None, batch_sampler: Optional[Sampler[Sequence[int]]] = None, num_workers: int = 0, collate_fn: Optional[_collate_fn_t] = None, pin_memory: bool = False, drop_last: bool = False, timeout: float = 0, worker_init_fn: Optional[_worker_init_fn_t] = None, multiprocessing_context=None, generator=None, *, prefetch_factor: int = 2):

3.1 单卡训练

如下我们给出单卡训练时使用DpexDataLoader的示例代码,具体代码细节参见测试代码文件.

from torchvision import datasets
from torchvision.transforms import ToTensor
from Dpex import dataloader

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)
# use DpexDataLoader
train_loader = dataloader.DpexDataLoader(training_data, distribute_mode=True, num_workers=10, batch_size=100, shuffle=True)

for epoch in range(3):
    for index, (image, label) in enumerate(train_loader):
       # your train process
       pass

3.2 基于DataParallel的多卡训练

如下我们给出使用DataParallel并行训练时使用DpexDataLoader的示例代码,具体代码细节参见测试代码文件.

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset
from Dpex import dataloader

class MyOwnDataset(Dataset):
     pass
     
# use DpexDataLoader
data_loader = dataloader.DpexDataLoader(dataset=RandomDataset(input_size, data_size),
                                        distribute_mode=True, batch_size=batch_size, shuffle=True, num_workers=10)

class Model(nn.Module):
    pass
    
model = Model()

if torch.cuda.is_available():
    model.cuda()

if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

for data in data_loader:
   # train your own model
   pass

3.3 基于DDP的多卡训练

如下我们给出使用DDP并行训练时使用DpexDataLoader的示例代码,具体代码细节参见测试代码文件.

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from Dpex.dataloader import DpexDataLoader
from torch.utils.data.distributed import DistributedSampler

# start command: CUDA_VISIBLE_DEVICES=1,6,7 python -m torch.distributed.launch --nproc_per_node=2 pytorch_ddp.py
# 1) 初始化
torch.distributed.init_process_group(backend="nccl")

input_size = 5
output_size = 2
batch_size = 1
data_size = 90000

# 2) 配置每个进程的gpu
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

dataset = RandomDataset(input_size, data_size)
# 3)使用DistributedSampler
rand_loader = DpexDataLoader(dataset=dataset, distribute_mode=True, batch_size=batch_size, sampler=DistributedSampler(dataset), num_workers=10)

class Model(nn.Module):
    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("  In Model: input size", input.size(),
              "output size", output.size())
        return output

model = Model(input_size, output_size)

# 4) 封装之前要把模型移到对应的gpu
model.to(device)

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # 5) 封装
    model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[local_rank],
                                                      output_device=local_rank)

for data in rand_loader:
    if torch.cuda.is_available():
        input_var = data
    else:
        input_var = data

    output = model(input_var)
    print("Outside: input size", input_var.size(), "output_size", output.size())

四、Benchmark

在接下来的Benchamark中我们核心展示两个部分的内容:

  • DpexDataLoader对于模型训练精度的影响
  • DpexDataLoader对于模型训练速度的影响

Dpex只是将单机数据预处理水平扩展到了多机以借助更多的算力来加速数据预处理而不改变数据本身的加载和与处理方式,所以本身对模型的精度不会有负面影响。对于数据预处理较重的情况

4.1 模型精度Benchmark

我们在FashionMNIST数据集上进行模型训练精度的Benchmark实验,具体代码细节见测试文件

Accuracy(%) Loss GPU Settings DpexDataLoader(distributed_mode=?) Epoch Learning rate Batch size
90.65 0.137 Single GPU True 40 0.001 100
91.09 0.112 Single GPU False 40 0.001 100
90.67 0.016 DataParallel True 40 0.001 100
90.32 0.008 DataParallel False 40 0.001 100
88.98 0.034 DDP True 40 0.001 100
89.84 0.030 DDP False 40 0.001 100

4.2 训练速度Benchmark

五、环境依赖:

Dpex借助Ray完成任务的跨机调度,所以若希望使用分布式数据预处理首先需要将你的训练机器构建成Ray的集群。Ray的集群构建细节具体参考Ray的相关文档

Owner
Dalong
I am now a master student in Tsinghua University and i major in software engeneering. Currently i am foucing on high performance graph learning system.
Dalong
yolov5 deepsort 行人 车辆 跟踪 检测 计数

yolov5 deepsort 行人 车辆 跟踪 检测 计数 实现了 出/入 分别计数。 默认是 南/北 方向检测,若要检测不同位置和方向,可在 main.py 文件第13行和21行,修改2个polygon的点。 默认检测类别:行人、自行车、小汽车、摩托车、公交车、卡车。 检测类别可在 detect

554 Dec 30, 2022
Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line

NAVER/LINE Vision 357 Jan 04, 2023
PoolFormer: MetaFormer is Actually What You Need for Vision

PoolFormer: MetaFormer is Actually What You Need for Vision (arXiv) This is a PyTorch implementation of PoolFormer proposed by our paper "MetaFormer i

Sea AI Lab 1k Dec 30, 2022
Mall-Customers-Segmentation - Customer Segmentation Using K-Means Clustering

Overview Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify th

NelakurthiSudheer 2 Jan 03, 2022
Official Pytorch and JAX implementation of "Efficient-VDVAE: Less is more"

The Official Pytorch and JAX implementation of "Efficient-VDVAE: Less is more" Arxiv preprint Louay Hazami   ·   Rayhane Mama   ·   Ragavan Thurairatn

Rayhane Mama 144 Dec 23, 2022
Unofficial Implementation of MLP-Mixer, gMLP, resMLP, Vision Permutator, S2MLPv2, RaftMLP, ConvMLP, ConvMixer in Jittor and PyTorch.

Unofficial Implementation of MLP-Mixer, gMLP, resMLP, Vision Permutator, S2MLPv2, RaftMLP, ConvMLP, ConvMixer in Jittor and PyTorch! Now, Rearrange and Reduce in einops.layers.jittor are support!!

130 Jan 08, 2023
implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning"

MarginGAN This repository is the implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning". 1."preliminary" is the imp

Van 7 Dec 23, 2022
This repository is for the preprint "A generative nonparametric Bayesian model for whole genomes"

BEAR Overview This repository contains code associated with the preprint A generative nonparametric Bayesian model for whole genomes (2021), which pro

Debora Marks Lab 10 Sep 18, 2022
[AAAI-2022] Official implementations of MCL: Mutual Contrastive Learning for Visual Representation Learning

Mutual Contrastive Learning for Visual Representation Learning This project provides source code for our Mutual Contrastive Learning for Visual Repres

winycg 48 Jan 02, 2023
Official PyTorch implementation of "The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation" (ICCV 21).

CenterGroup This the official implementation of our ICCV 2021 paper The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person P

Dynamic Vision and Learning Group 43 Dec 25, 2022
Combine Tacotron2 and Hifi GAN to generate speech from text

EndToEndTextToSpeech Combine Tacotron2 and Hifi GAN to generate speech from text Download weights Hifi GAN - hifi_gan/checkpoint/ : pretrain 2.5M ste

Phạm Quốc Huy 1 Dec 18, 2021
STEAL - Learning Semantic Boundaries from Noisy Annotations (CVPR 2019)

STEAL This is the official inference code for: Devil Is in the Edges: Learning Semantic Boundaries from Noisy Annotations David Acuna, Amlan Kar, Sanj

469 Dec 26, 2022
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

137 Dec 15, 2022
Решения, подсказки, тесты и утилиты для тренировки по алгоритмам от Яндекса.

Решения и подсказки к тренировке по алгоритмам от Яндекса Что есть внутри Решения с подсказками и комментариями; рекомендую сначала смотреть md файл п

Yankovsky Andrey 50 Dec 26, 2022
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

688 Jan 04, 2023
PyTorch implementation of CVPR'18 - Perturbative Neural Networks

This is an attempt to reproduce results in Perturbative Neural Networks paper. See original repo for details.

Michael Klachko 57 May 14, 2021
Codebase for INVASE: Instance-wise Variable Selection - 2019 ICLR

Codebase for "INVASE: Instance-wise Variable Selection" Authors: Jinsung Yoon, James Jordon, Mihaela van der Schaar Paper: Jinsung Yoon, James Jordon,

Jinsung Yoon 50 Nov 11, 2022
StyleTransfer - Open source style transfer project, based on VGG19

StyleTransfer - Open source style transfer project, based on VGG19

Patrick martins de lima 9 Dec 13, 2021
A collection of scripts I developed for personal and working projects.

A collection of scripts I developed for personal and working projects Table of contents Introduction Repository diagram structure List of scripts pyth

Gianluca Bianco 109 Dec 26, 2022
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

Intel Labs 72 Dec 16, 2022