A high-performance distributed deep learning system targeting large-scale and automated distributed training.

Overview

HETU

Documentation | Examples

Hetu is a high-performance distributed deep learning system targeting trillions of parameters DL model training, developed by DAIR Lab at Peking University. It takes account of both high availability in industry and innovation in academia, which has a number of advanced characteristics:

  • Applicability. DL model definition with standard dataflow graph; many basic CPU and GPU operators; efficient implementation of more than plenty of DL models and at least popular 10 ML algorithms.

  • Efficiency. Achieve at least 30% speedup compared to TensorFlow on DNN, CNN, RNN benchmarks.

  • Flexibility. Supporting various parallel training protocols and distributed communication architectures, such as Data/Model/Pipeline parallel; Parameter server & AllReduce.

  • Scalability. Deployment on more than 100 computation nodes; Training giant models with trillions of model parameters, e.g., Criteo Kaggle, Open Graph Benchmark

  • Agility. Automatically ML pipeline: feature engineering, model selection, hyperparameter search.

We welcome everyone interested in machine learning or graph computing to contribute codes, create issues or pull requests. Please refer to Contribution Guide for more details.

Installation

  1. Clone the repository.

  2. Prepare the environment. We use Anaconda to manage packages. The following command create the conda environment to be used:conda env create -f environment.yml. Please prepare Cuda toolkit and CuDNN in advance.

  3. We use CMake to compile Hetu. Please copy the example configuration for compilation by cp cmake/config.example.cmake cmake/config.cmake. Users can modify the configuration file to enable/disable the compilation of each module. For advanced users (who not using the provided conda environment), the prerequisites for different modules in Hetu is listed in appendix.

# modify paths and configurations in cmake/config.cmake

# generate Makefile
mkdir build && cd build && cmake ..

# compile
# make all
make -j 8
# make hetu, version is specified in cmake/config.cmake
make hetu -j 8
# make allreduce module
make allreduce -j 8
# make ps module
make ps -j 8
# make geometric module
make geometric -j 8
# make hetu-cache module
make hetu_cache -j 8
  1. Prepare environment for running. Edit the hetu.exp file and set the environment path for python and the path for executable mpirun if necessary (for advanced users not using the provided conda environment). Then execute the command source hetu.exp .

Usage

Train logistic regression on gpu:

bash examples/cnn/scripts/hetu_1gpu.sh logreg MNIST

Train a 3-layer mlp on gpu:

bash examples/cnn/scripts/hetu_1gpu.sh mlp CIFAR10

Train a 3-layer cnn with gpu:

bash examples/cnn/scripts/hetu_1gpu.sh cnn_3_layers MNIST

Train a 3-layer mlp with allreduce on 8 gpus (use mpirun):

bash examples/cnn/scripts/hetu_8gpu.sh mlp CIFAR10

Train a 3-layer mlp with PS on 1 server and 2 workers:

# in the script we launch the scheduler and server, and two workers
bash examples/cnn/scripts/hetu_2gpu_ps.sh mlp CIFAR10

More Examples

Please refer to examples directory, which contains CNN, NLP, CTR, GNN training scripts. For distributed training, please refer to CTR and GNN tasks.

Community

License

The entire codebase is under license

Papers

  1. Xupeng Miao, Lingxiao Ma, Zhi Yang, Yingxia Shao, Bin Cui, Lele Yu, Jiawei Jiang. CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs. TKDE 2021, ICDE 2021
  2. Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, Bin Cui. Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce. SIGMOD 2021
  3. Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, Bin Cui. HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework. VLDB 2022, ChinaSys 2021 Winter.
  4. coming soon

Cite

If you use Hetu in a scientific publication, we would appreciate citations to the following paper:

 @inproceedings{vldb/het22,
   title = {HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework},
   author = {Xupeng Miao and
         Hailin Zhang and
         Yining Shi and
             Xiaonan Nie and
             Zhi Yang and
             Yangyu Tao and
             Bin Cui},
   journal = {Proc. {VLDB} Endow.},
   year = {2022},
   url  = {https://doi.org/10.14778/3489496.3489511},
   doi  = {10.14778/3489496.3489511},
 }

Acknowledgements

We learned and borrowed insights from a few open source projects including TinyFlow, autodist, tf.distribute and Angel.

Appendix

The prerequisites for different modules in Hetu is listed as follows:

"*" means you should prepare by yourself, while others support auto-download

Hetu: OpenMP(*), CMake(*)
Hetu (version mkl): MKL 1.6.1
Hetu (version gpu): CUDA 10.1(*), CUDNN 7.5(*)
Hetu (version all): both

Hetu-AllReduce: MPI 3.1, NCCL 2.8(*), this module needs GPU version

Hetu-PS: Protobuf(*), ZeroMQ 4.3.2

Hetu-Geometric: Pybind11(*), Metis(*)

Hetu-Cache: Pybind11(*), this module needs PS module

##################################################################
Tips for preparing the prerequisites

Preparing CUDA, CUDNN, NCCL(NCCl is already in conda environment):
1. download from https://developer.nvidia.com
2. install
3. modify paths in cmake/config.cmake if necessary

Preparing OpenMP:
Your just need to ensure your compiler support openmp.

Preparing CMake, Protobuf, Pybind11, Metis:
Install by anaconda: 
conda install cmake=3.18 libprotobuf pybind11=2.6.0 metis

Preparing OpenMPI (not necessary):
install by anaconda: `conda install -c conda-forge openmpi=4.0.3`
or
1. download from https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.3.tar.gz
2. build openmpi by `./configure /path/to/build && make -j8 && make install`
3. modify MPI_HOME to /path/to/build in cmake/config.cmake

Preparing MKL (not necessary):
install by anaconda: `conda install -c conda-forge onednn`
or
1. download from https://github.com/intel/mkl-dnn/archive/v1.6.1.tar.gz
2. build mkl by `mkdir /path/to/build && cd /path/to/build && cmake /path/to/root && make -j8` 
3. modify MKL_ROOT to /path/to/root and MKL_BUILD to /path/to/build in cmake/config.cmake 

Preparing ZeroMQ (not necessary):
install by anaconda: `conda install -c anaconda zeromq=4.3.2`
or
1. download from https://github.com/zeromq/libzmq/releases/download/v4.3.2/zeromq-4.3.2.zip
2. build zeromq by 'mkdir /path/to/build && cd /path/to/build && cmake /path/to/root && make -j8`
3. modify ZMQ_ROOT to /path/to/build in cmake/config.cmake
Owner
DAIR Lab
Data and Intelligence Research (DAIR) Lab @ Peking University
DAIR Lab
Keras implementation of Real-Time Semantic Segmentation on High-Resolution Images

Keras-ICNet [paper] Keras implementation of Real-Time Semantic Segmentation on High-Resolution Images. Training in progress! Requisites Python 3.6.3 K

Aitor Ruano 87 Dec 16, 2022
CS50x-AI - Artificial Intelligence with Python from Harvard University

CS50x-AI Artificial Intelligence with Python from Harvard University 📖 Table of

Hosein Damavandi 6 Aug 22, 2022
Elucidating Robust Learning with Uncertainty-Aware Corruption Pattern Estimation

Elucidating Robust Learning with Uncertainty-Aware Corruption Pattern Estimation Introduction 📋 Official implementation of Explainable Robust Learnin

JeongEun Park 6 Apr 19, 2022
Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Duong H. Le 18 Jun 13, 2022
LightNet++: Boosted Light-weighted Networks for Real-time Semantic Segmentation

LightNet++ !!!New Repo.!!! ⇒ EfficientNet.PyTorch: Concise, Modular, Human-friendly PyTorch implementation of EfficientNet with Pre-trained Weights !!

linksense 237 Jan 05, 2023
PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Unbiased Teacher for Semi-Supervised Object Detection This is the PyTorch implementation of our paper: Unbiased Teacher for Semi-Supervised Object Detection

Facebook Research 366 Dec 28, 2022
This is Official implementation for "Pose-guided Feature Disentangling for Occluded Person Re-Identification Based on Transformer" in AAAI2022

PFD:Pose-guided Feature Disentangling for Occluded Person Re-identification based on Transformer This repo is the official implementation of "Pose-gui

Tao Wang 93 Dec 18, 2022
Code for PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning

PackNet: https://arxiv.org/abs/1711.05769 Pretrained models are available here: https://uofi.box.com/s/zap2p03tnst9dfisad4u0sfupc0y1fxt Datasets in Py

Arun Mallya 216 Jan 05, 2023
official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting By Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu

77 Dec 27, 2022
This solves the autonomous driving issue which is supported by deep learning technology. Given a video, it splits into images and predicts the angle of turning for each frame.

Self Driving Car An autonomous car (also known as a driverless car, self-driving car, and robotic car) is a vehicle that is capable of sensing its env

Sagor Saha 4 Sep 04, 2021
Unoffical implementation about Image Super-Resolution via Iterative Refinement by Pytorch

Image Super-Resolution via Iterative Refinement Paper | Project Brief This is a unoffical implementation about Image Super-Resolution via Iterative Re

LiangWei Jiang 2.5k Jan 02, 2023
OpenCVのGrabCut()を利用したセマンティックセグメンテーション向けアノテーションツール(Annotation tool using GrabCut() of OpenCV. It can be used to create datasets for semantic segmentation.)

[Japanese/English] GrabCut-Annotation-Tool GrabCut-Annotation-Tool.mp4 OpenCVのGrabCut()を利用したアノテーションツールです。 セマンティックセグメンテーション向けのデータセット作成にご使用いただけます。 ※Grab

KazuhitoTakahashi 30 Nov 18, 2022
This is a model made out of Neural Network specifically a Convolutional Neural Network model

This is a model made out of Neural Network specifically a Convolutional Neural Network model. This was done with a pre-built dataset from the tensorflow and keras packages. There are other alternativ

9 Oct 18, 2022
Rasterize with the least efforts for researchers.

utils3d Rasterize and do image-based 3D transforms with the least efforts for researchers. Based on numpy and OpenGL. It could be helpful when you wan

Ruicheng Wang 8 Dec 15, 2022
public repo for ESTER dataset and modeling (EMNLP'21)

Project / Paper Introduction This is the project repo for our EMNLP'21 paper: https://arxiv.org/abs/2104.08350 Here, we provide brief descriptions of

PlusLab 19 Oct 27, 2022
Public repository containing materials used for Feed Forward (FF) Neural Networks article.

Art041_NN_Feed_Forward Public repository containing materials used for Feed Forward (FF) Neural Networks article. -- Illustration of a very simple Fee

SolClover 2 Dec 29, 2021
PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision" Tolstikhin et al. (2021)

mlp-mixer-pytorch PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision" Tolstikhin et al. (2021) Usage import torch from mlp_mixer

isaac 27 Jul 09, 2022
Baselines for TrajNet++

TrajNet++ : The Trajectory Forecasting Framework PyTorch implementation of Human Trajectory Forecasting in Crowds: A Deep Learning Perspective TrajNet

VITA lab at EPFL 183 Jan 05, 2023
Few-Shot-Intent-Detection includes popular challenging intent detection datasets with/without OOS queries and state-of-the-art baselines and results.

Few-Shot-Intent-Detection Few-Shot-Intent-Detection is a repository designed for few-shot intent detection with/without Out-of-Scope (OOS) intents. It

Jian-Guo Zhang 73 Dec 26, 2022
[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Grounded Situation Recognition with Transformers Paper | Model Checkpoint This is the official PyTorch implementation of Grounded Situation Recognitio

Junhyeong Cho 18 Jul 19, 2022