Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Overview

Sky Computing

Introduction

Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to devices based on the their hardware sepcification. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT in a 64-node cluster. Our paper can be found at https://arxiv.org/abs/2202.11836

The concept sky computing was first introduced by Dr. Katarzyna Keahey et al. They used this word to describe a cross-cloud compute pattern. And later Prof. Stoica and Prof. Shenker generalized this word to geo-distributed computing. Our project is based on their definition. [1] [2]

Installation

git clone [email protected]:hpcaitech/SkyComputing.git
python -m pip install -r requirements.txt
cd ./scaelum
python -m pip install -v -e .

Experiment (using BERT)

To benchmark the Sky Computing, we prepared a single demo which you can run on your cluster to train BERT.

Prepare BERT model

Bidirectional Encoder Representations from Transformers (aka BERT) is one of the state-of-the-art deep learning models for Natural Language Processing. In the experiment part, we use BERT to run a simple benchmark.

cd $PROJECT
mkdir -p BERT/model && cd BERT/model 
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

Prepare GLUE MNLI dataset

The General Language Understanding Evaluation (aka GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. And the Multi-Genre Natural Language Inference (aka MNLI) is one of the tasks in GLUE, it is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.

cd $PROJECT
mkdir -p BERT/data && cd BERT/data
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/1502038877f6a88c225a34450793fbc3ea87eaba/download_glue_data.py
python download_glue_data.py --data_dir ./glue_data --tasks MNLI

Configuration

To run dllb in your cluster, you need to write a config file which contains the necessary information about training, e.g. model layers, useful environment variables. We have provided a well-commentted example, and here are some most important option:

# your project path
PROJECT = os.getenv("PROJECT")

# allocation type, valid values are even, optimal and dynamic
ALLOCATE_TYPE = "even"

# num of node (including the central server)
CORE_NUM = 4

Run scripts

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We used slurm script to run our experiment.

#!/bin/sh

#SBATCH --job-name=gpu16   # Job name
#SBATCH -o gpu16.o%j       # Name of stdout output file
#SBATCH -e gpu16.e%j       # Name of stderr error file
#SBATCH -N 16              # Node numbers
#SBATCH -n 16              # GPU numbers
#SBATCH --time=02:00:00    # Run time (hh:mm:ss)

# run
python ./ip_addr.py > "./HOST"
srun python ./launch.py -c "./experiment/config.py"

Citation

@misc{zhu2022sky,
      title={Sky Computing: Accelerating Geo-distributed Computing in Federated Learning}, 
      author={Jie Zhu and Shenggui Li and Yang You},
      year={2022},
      eprint={2202.11836},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Reference

@article{keahey2009sky,
  title={Sky computing},
  author={Keahey, Katarzyna and Tsugawa, Mauricio and Matsunaga, Andrea and Fortes, Jose},
  journal={IEEE Internet Computing},
  volume={13},
  number={5},
  pages={43--51},
  year={2009},
  publisher={IEEE}
}
@inproceedings{stoica2021cloud,
  title={From cloud computing to sky computing},
  author={Stoica, Ion and Shenker, Scott},
  booktitle={Proceedings of the Workshop on Hot Topics in Operating Systems},
  pages={26--32},
  year={2021}
}
Owner
HPC-AI Tech
We are a global team to help you train and deploy your AI models
HPC-AI Tech
InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images

InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images Hong Wang, Yuexiang Li, Haimiao Zhang, Deyu Men

Hong Wang 4 Dec 27, 2022
[CVPR 2021] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans Introduction We introduce the task of dense captioning in 3D scans from commodity RGB-D sensor

Dave Z. Chen 79 Nov 07, 2022
Repo for CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning

CReST in Tensorflow 2 Code for the paper: "CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning" by Chen Wei, Ki

Google Research 75 Nov 01, 2022
PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in clustering (CVPR2021)

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering Jang Hyun Cho1, Utkarsh Mall2, Kavita Bala2, Bharath Harihar

Jang Hyun Cho 164 Dec 30, 2022
Yoloxkeypointsegment - An anchor-free version of YOLO, with a simpler design but better performance

Introduction 关键点版本:已完成 全景分割版本:已完成 实例分割版本:已完成 YOLOX is an anchor-free version of

23 Oct 20, 2022
This is the official pytorch implementation of the BoxEL for the description logic EL++

BoxEL: Box EL++ Embedding This is the official pytorch implementation of the BoxEL for the description logic EL++. BoxEL++ is a geometric approach bas

1 Nov 03, 2022
TVNet: Temporal Voting Network for Action Localization

TVNet: Temporal Voting Network for Action Localization This repo holds the codes of paper: "TVNet: Temporal Voting Network for Action Localization". P

hywang 5 Jul 26, 2022
A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Pandas_by_examples A compendium of useful/interesting/inspirational usage of pandas functions, each example will be an ipynb file What is this reposit

Guangyuan(Frank) Li 32 Nov 20, 2022
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

24 Dec 28, 2022
Python package for multiple object tracking research with focus on laboratory animals tracking.

motutils is a Python package for multiple object tracking research with focus on laboratory animals tracking. Features loads: MOTChallenge CSV, sleap

Matěj Šmíd 2 Sep 05, 2022
Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

ademxapp Visual applications by the University of Adelaide In designing our Model A, we did not over-optimize its structure for efficiency unless it w

Zifeng Wu 338 Dec 12, 2022
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

VITA 77 Oct 05, 2022
Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

Vittorio Mazzia 203 Jan 08, 2023
This repository is a series of notebooks that show solutions for the projects at Dataquest.io.

Dataquest Project Solutions This repository is a series of notebooks that show solutions for the projects at Dataquest.io. Of course, there are always

Dataquest 1.1k Dec 30, 2022
code for Grapadora research paper experimentation

Road feature embedding selection method Code for research paper experimentation Abstract Traffic forecasting models rely on data that needs to be sens

Eric López Manibardo 0 May 26, 2022
Using Language Model to Bootstrap Human Activity Recognition Ambient Sensors Based in Smart Homes

Using Language Model to Bootstrap Human Activity Recognition Ambient Sensors Based in Smart Homes This repository is the official implementation of Us

Damien Bouchabou 0 Oct 18, 2021
Model Zoo of BDD100K Dataset

Model Zoo of BDD100K Dataset

ETH VIS Group 200 Dec 27, 2022
PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models Code accompanying CVPR'20 paper of the same title. Paper lin

Alex Damian 7k Dec 30, 2022
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Alibaba 1.4k Jan 01, 2023
[NeurIPS 2021] Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects

[NeurIPS 2021] Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects YouTube | arXiv Prerequisites Kaolin is available here:

Denys Rozumnyi 107 Dec 26, 2022