Secure Distributed Training at Scale

Related tags

Deep Learningbtard
Overview

Secure Distributed Training at Scale

This repository contains the implementation of experiments from the paper

"Secure Distributed Training at Scale"

Eduard Gorbunov*, Alexander Borzunov*, Michael Diskin, Max Ryabinin

[PDF] arxiv.org

Overview

The code is organized as follows:

  • ./resnet is a setup for training ResNet18 on CIFAR-10 with simulated byzantine attackers
  • ./albert runs distributed training of ALBERT-large with byzantine attacks using cloud instances

ResNet18

This setup uses torch.distributed for parallelism.

Requirements
  • Python >= 3.7 (we recommend Anaconda python 3.8)
  • Dependencies: pip install jupyter torch>=1.6.0 torchvision>=0.7.0 tensorboard
  • A machine with at least 16GB RAM and either a GPU with >24GB memory or 3 GPUs with at least 10GB memory each.
  • We tested the code on Ubuntu Server 18.04, it should work with all major linux distros. For Windows, we recommend using Docker (e.g. via Kitematic).

Running experiments: please open ./resnet/RunExperiments.ipynb and follow the instructions in that notebook. The learning curves will be available in Tensorboard logs: tensorboard --logdir btard/resnet.

ALBERT

This setup spawns distributed nodes that collectively train ALBERT-large on wikitext103. It uses a version of the hivemind library modified so that some peers may be programmed to become Byzantine and perform various types of attacks on the training process.

Requirements
  • The experiments are optimized for 16 instances each with a single T4 GPU.

    • For your convenience, we provide a cost-optimized AWS starter notebook that can run experiments (see below)
    • While it can be simulated with a single node, doing so will require additional tuning depending on the number and type of GPUs available.
  • If running manually, please install the core library on each machine:

    • The code requires python >= 3.7 (we recommend Anaconda python 3.8)
    • Install the library: cd ./albert/hivemind/ && pip install -e .
    • If successful, it should become available as import hivemind

Running experiments: For your convenience, we provide a unified script that runs a distributed ALBERT experiment in the AWS cloud ./albert/experiments/RunExperiments.ipynb using preemptible T4 instances. The learning curves will be posted to the Wandb project specified during the notebook setup.

Expected cloud costs: a training experiment with 16 hosts takes up approximately $60 per day for g4dn.xlarge and $90 per day for g4dn.2xlarge instances. One can expect a full training experiment to converge in ≈3 days. Once the model is trained, one can restart training from intermediate checkpoints and simulate attacks. One attack episode takes up 4-5 hours depending on cloud availability.

Owner
Yandex Research
Yandex Research
Official PyTorch implementation of our AAAI22 paper: TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework via Self-Supervised Multi-Task Learning. Code will be available soon.

Official-PyTorch-Implementation-of-TransMEF Official PyTorch implementation of our AAAI22 paper: TransMEF: A Transformer-Based Multi-Exposure Image Fu

117 Dec 27, 2022
[3DV 2020] PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction

PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction International Conference on 3D Vision, 2020 Sai Sagar Jinka1, Rohan

Rohan Chacko 39 Oct 12, 2022
The VeriNet toolkit for verification of neural networks

VeriNet The VeriNet toolkit is a state-of-the-art sound and complete symbolic interval propagation based toolkit for verification of neural networks.

9 Dec 21, 2022
Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper.

Fully Adaptive Bayesian Algorithm for Data Analysis FABADA FABADA is a novel non-parametric noise reduction technique which arise from the point of vi

18 Oct 20, 2022
ML-PersonalWork - Big assignment PersonalWork in Machine Learning, 2021 autumn BUAA.

ML-PersonalWork - Big assignment PersonalWork in Machine Learning, 2021 autumn BUAA.

Snapdragon Lee 2 Dec 16, 2022
Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)

A simple API for working with University of California, Irvine (UCI) Machine Learning (ML) repository Table of Contents Introduction About Page of the

Tirthajyoti Sarkar 223 Dec 05, 2022
A simple approach to emable dense segmentation with ViT.

Vision Transformer Segmentation Network This implementation of ViT in pytorch uses a super simple and straight-forward way of generating an output of

HReynaud 5 Jan 03, 2023
a dnn ai project to classify which food people are eating on audio recordings

Deep Learning - EAT Challenge About This project is part of an AI challenge of the DeepLearning course 2021 at the University of Augsburg. The objecti

Marco Tröster 1 Oct 24, 2021
Codes of the paper Deformable Butterfly: A Highly Structured and Sparse Linear Transform.

Deformable Butterfly: A Highly Structured and Sparse Linear Transform DeBut Advantages DeBut generalizes the square power of two butterfly factor matr

Rui LIN 8 Jun 10, 2022
Pytorch Lightning code guideline for conferences

Deep learning project seed Use this seed to start new deep learning / ML projects. Built in setup.py Built in requirements Examples with MNIST Badges

Pytorch Lightning 1k Jan 02, 2023
Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

DiLBERT Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP" Pretrained Model The pretrained model presented in the paper is

Kevin Roitero 2 Dec 15, 2022
robomimic: A Modular Framework for Robot Learning from Demonstration

robomimic [Homepage]   [Documentation]   [Study Paper]   [Study Website]   [ARISE Initiative] Latest Updates [08/09/2021] v0.1.0: Initial code and pap

ARISE Initiative 178 Jan 05, 2023
Mememoji - A facial expression classification system that recognizes 6 basic emotions: happy, sad, surprise, fear, anger and neutral.

a project built with deep convolutional neural network and ❤️ Table of Contents Motivation The Database The Model 3.1 Input Layer 3.2 Convolutional La

Jostine Ho 761 Dec 05, 2022
UAV-Networks-Routing is a Python simulator for experimenting routing algorithms and mac protocols on unmanned aerial vehicle networks.

UAV-Networks Simulator - Autonomous Networking - A.A. 20/21 UAV-Networks-Routing is a Python simulator for experimenting routing algorithms and mac pr

0 Nov 13, 2021
Programming with Neural Surrogates of Programs

Programming with Neural Surrogates of Programs

0 Dec 12, 2021
CRISCE: Automatically Generating Critical Driving Scenarios From Car Accident Sketches

CRISCE: Automatically Generating Critical Driving Scenarios From Car Accident Sketches This document describes how to install and use CRISCE (CRItical

Chair of Software Engineering II, Uni Passau 2 Feb 09, 2022
Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Query Embedding on Hyper-Relational Knowledge Graphs This repository contains the code used for the experiments in the paper Query Embedding on Hyper-

DimitrisAlivas 19 Jul 26, 2022
[Machine Learning Engineer Basic Guide] 부스트캠프 AI Tech - Product Serving 자료

Boostcamp-AI-Tech-Product-Serving 부스트캠프 AI Tech - Product Serving 자료 Repository 구조 part1(MLOps 개론, Model Serving, 머신러닝 프로젝트 라이프 사이클은 별도의 코드가 없으며, part

Sung Yun Byeon 269 Dec 21, 2022
offical implement of our Lifelong Person Re-Identification via Adaptive Knowledge Accumulation in CVPR2021

LifelongReID Offical implementation of our Lifelong Person Re-Identification via Adaptive Knowledge Accumulation in CVPR2021 by Nan Pu, Wei Chen, Yu L

PeterPu 76 Dec 08, 2022
A rule-based log analyzer & filter

Flog 一个根据规则集来处理文本日志的工具。 前言 在日常开发过程中,由于缺乏必要的日志规范,导致很多人乱打一通,一个日志文件夹解压缩后往往有几十万行。 日志泛滥会导致信息密度骤减,给排查问题带来了不小的麻烦。 以前都是用grep之类的工具先挑选出有用的,再逐条进行排查,费时费力。在忍无可忍之后决

上山打老虎 9 Jun 23, 2022