A high-performance distributed deep learning system targeting large-scale and automated distributed training.

Overview

HETU

Documentation | Examples

Hetu is a high-performance distributed deep learning system targeting trillions of parameters DL model training, developed by DAIR Lab at Peking University. It takes account of both high availability in industry and innovation in academia, which has a number of advanced characteristics:

  • Applicability. DL model definition with standard dataflow graph; many basic CPU and GPU operators; efficient implementation of more than plenty of DL models and at least popular 10 ML algorithms.

  • Efficiency. Achieve at least 30% speedup compared to TensorFlow on DNN, CNN, RNN benchmarks.

  • Flexibility. Supporting various parallel training protocols and distributed communication architectures, such as Data/Model/Pipeline parallel; Parameter server & AllReduce.

  • Scalability. Deployment on more than 100 computation nodes; Training giant models with trillions of model parameters, e.g., Criteo Kaggle, Open Graph Benchmark

  • Agility. Automatically ML pipeline: feature engineering, model selection, hyperparameter search.

We welcome everyone interested in machine learning or graph computing to contribute codes, create issues or pull requests. Please refer to Contribution Guide for more details.

Installation

  1. Clone the repository.

  2. Prepare the environment. We use Anaconda to manage packages. The following command create the conda environment to be used:conda env create -f environment.yml. Please prepare Cuda toolkit and CuDNN in advance.

  3. We use CMake to compile Hetu. Please copy the example configuration for compilation by cp cmake/config.example.cmake cmake/config.cmake. Users can modify the configuration file to enable/disable the compilation of each module. For advanced users (who not using the provided conda environment), the prerequisites for different modules in Hetu is listed in appendix.

# modify paths and configurations in cmake/config.cmake

# generate Makefile
mkdir build && cd build && cmake ..

# compile
# make all
make -j 8
# make hetu, version is specified in cmake/config.cmake
make hetu -j 8
# make allreduce module
make allreduce -j 8
# make ps module
make ps -j 8
# make geometric module
make geometric -j 8
# make hetu-cache module
make hetu_cache -j 8
  1. Prepare environment for running. Edit the hetu.exp file and set the environment path for python and the path for executable mpirun if necessary (for advanced users not using the provided conda environment). Then execute the command source hetu.exp .

Usage

Train logistic regression on gpu:

bash examples/cnn/scripts/hetu_1gpu.sh logreg MNIST

Train a 3-layer mlp on gpu:

bash examples/cnn/scripts/hetu_1gpu.sh mlp CIFAR10

Train a 3-layer cnn with gpu:

bash examples/cnn/scripts/hetu_1gpu.sh cnn_3_layers MNIST

Train a 3-layer mlp with allreduce on 8 gpus (use mpirun):

bash examples/cnn/scripts/hetu_8gpu.sh mlp CIFAR10

Train a 3-layer mlp with PS on 1 server and 2 workers:

# in the script we launch the scheduler and server, and two workers
bash examples/cnn/scripts/hetu_2gpu_ps.sh mlp CIFAR10

More Examples

Please refer to examples directory, which contains CNN, NLP, CTR, GNN training scripts. For distributed training, please refer to CTR and GNN tasks.

Community

License

The entire codebase is under license

Papers

  1. Xupeng Miao, Lingxiao Ma, Zhi Yang, Yingxia Shao, Bin Cui, Lele Yu, Jiawei Jiang. CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs. TKDE 2021, ICDE 2021
  2. Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, Bin Cui. Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce. SIGMOD 2021
  3. Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, Bin Cui. HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework. VLDB 2022, ChinaSys 2021 Winter.
  4. coming soon

Cite

If you use Hetu in a scientific publication, we would appreciate citations to the following paper:

 @inproceedings{vldb/het22,
   title = {HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework},
   author = {Xupeng Miao and
         Hailin Zhang and
         Yining Shi and
             Xiaonan Nie and
             Zhi Yang and
             Yangyu Tao and
             Bin Cui},
   journal = {Proc. {VLDB} Endow.},
   year = {2022},
   url  = {https://doi.org/10.14778/3489496.3489511},
   doi  = {10.14778/3489496.3489511},
 }

Acknowledgements

We learned and borrowed insights from a few open source projects including TinyFlow, autodist, tf.distribute and Angel.

Appendix

The prerequisites for different modules in Hetu is listed as follows:

"*" means you should prepare by yourself, while others support auto-download

Hetu: OpenMP(*), CMake(*)
Hetu (version mkl): MKL 1.6.1
Hetu (version gpu): CUDA 10.1(*), CUDNN 7.5(*)
Hetu (version all): both

Hetu-AllReduce: MPI 3.1, NCCL 2.8(*), this module needs GPU version

Hetu-PS: Protobuf(*), ZeroMQ 4.3.2

Hetu-Geometric: Pybind11(*), Metis(*)

Hetu-Cache: Pybind11(*), this module needs PS module

##################################################################
Tips for preparing the prerequisites

Preparing CUDA, CUDNN, NCCL(NCCl is already in conda environment):
1. download from https://developer.nvidia.com
2. install
3. modify paths in cmake/config.cmake if necessary

Preparing OpenMP:
Your just need to ensure your compiler support openmp.

Preparing CMake, Protobuf, Pybind11, Metis:
Install by anaconda: 
conda install cmake=3.18 libprotobuf pybind11=2.6.0 metis

Preparing OpenMPI (not necessary):
install by anaconda: `conda install -c conda-forge openmpi=4.0.3`
or
1. download from https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.3.tar.gz
2. build openmpi by `./configure /path/to/build && make -j8 && make install`
3. modify MPI_HOME to /path/to/build in cmake/config.cmake

Preparing MKL (not necessary):
install by anaconda: `conda install -c conda-forge onednn`
or
1. download from https://github.com/intel/mkl-dnn/archive/v1.6.1.tar.gz
2. build mkl by `mkdir /path/to/build && cd /path/to/build && cmake /path/to/root && make -j8` 
3. modify MKL_ROOT to /path/to/root and MKL_BUILD to /path/to/build in cmake/config.cmake 

Preparing ZeroMQ (not necessary):
install by anaconda: `conda install -c anaconda zeromq=4.3.2`
or
1. download from https://github.com/zeromq/libzmq/releases/download/v4.3.2/zeromq-4.3.2.zip
2. build zeromq by 'mkdir /path/to/build && cd /path/to/build && cmake /path/to/root && make -j8`
3. modify ZMQ_ROOT to /path/to/build in cmake/config.cmake
Owner
DAIR Lab
Data and Intelligence Research (DAIR) Lab @ Peking University
DAIR Lab
Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Gradient Cache Gradient Cache is a simple technique for unlimitedly scaling contrastive learning batch far beyond GPU memory constraint. This means tr

Luyu Gao 198 Dec 29, 2022
wgan, wgan2(improved, gp), infogan, and dcgan implementation in lasagne, keras, pytorch

Generative Adversarial Notebooks Collection of my Generative Adversarial Network implementations Most codes are for python3, most notebooks works on C

tjwei 1.5k Dec 16, 2022
Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

SinIR (Official Implementation) Requirements To install requirements: pip install -r requirements.txt We used Python 3.7.4 and f-strings which are in

47 Oct 11, 2022
Stream images from a connected camera over MQTT, view using Streamlit, record to file and sqlite

mqtt-camera-streamer Summary: Publish frames from a connected camera or MJPEG/RTSP stream to an MQTT topic, and view the feed in a browser on another

Robin Cole 183 Dec 16, 2022
Deep-Learning-Image-Captioning - Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

Deep Learning - Image Captioning with Convolutional and Recurrent Neural Nets ========================================================================

23 Apr 06, 2022
Blender scripts for computing geodesic distance

GeoDoodle Geodesic distance computation for Blender meshes Table of Contents Overivew Usage Implementation Overview This addon provides an operator fo

20 Jun 08, 2022
EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation

EFENet EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation Code is a bit messy now. I woud clean up soon. For training the EF

Yaping Zhao 19 Nov 05, 2022
This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

CPC_DeepCluster This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEEC

LEAP Lab 2 Sep 15, 2022
Self-attentive task GAN for space domain awareness data augmentation.

SATGAN TODO: update the article URL once published. Article about this implemention The self-attentive task generative adversarial network (SATGAN) le

Nathan 2 Mar 24, 2022
ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representa

Bats Research 94 Nov 21, 2022
Sibur challange 2021 competition - 6 place

sibur challange 2021 Решение на 6 место: https://sibur.ai-community.com/competitions/5/tasks/13 Скор 1.4066/1.4159 public/private. Архитектура - однос

Ivan 5 Jan 11, 2022
Code and models for "Rethinking Deep Image Prior for Denoising" (ICCV 2021)

DIP-denosing This is a code repo for Rethinking Deep Image Prior for Denoising (ICCV 2021). Addressing the relationship between Deep image prior and e

Computer Vision Lab. @ GIST 36 Dec 29, 2022
Semi-supervised semantic segmentation needs strong, varied perturbations

Semi-supervised semantic segmentation using CutMix and Colour Augmentation Implementations of our papers: Semi-supervised semantic segmentation needs

146 Dec 20, 2022
This is a repository of our model for weakly-supervised video dense anticipation.

Introduction This is a repository of our model for weakly-supervised video dense anticipation. More results on GTEA, Epic-Kitchens etc. will come soon

2 Apr 09, 2022
Code for "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" @ICRA2021

CloudAAE This is an tensorflow implementation of "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" Files log:

Gee 35 Nov 14, 2022
Solution to the Weather4cast 2021 challenge

This code was used for the entry by the team "antfugue" for the Weather4cast 2021 Challenge. Below, you can find the instructions for generating predi

Jussi Leinonen 13 Jan 03, 2023
Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

Phil Wang 207 Dec 23, 2022
Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance [Video Demo] [Paper] Installation Requirements Python 3.6 PyTorch 1.1.0 Pleas

Jiachen Xu 19 Oct 28, 2022
This is an example of object detection on Micro bacterium tuberculosis using Mask-RCNN

Mask-RCNN on Mycobacterium tuberculosis This is an example of object detection on Mycobacterium Tuberculosis using Mask RCNN. Implement of Mask R-CNN

Jun-En Ding 1 Sep 16, 2021
PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to handle and build

simple, elegant and safe Introduction PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to ha

Johnsz 2 Mar 02, 2022