OpenDILab RL Kubernetes Custom Resource and Operator Lib

Last update: Dec 29, 2022

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.

kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Comments

在 Pod 内增加集群信息
希望以 dijob replica 方式提交时，每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序，增加以下几个环境变量：

replica 中所有 pod 的 FQDN，依据启动顺序排序

当前 pod 的 FQDN

当前 pod 的顺序编号

DI-engine 中会根据这些变量实现对应的网络连接，attach-to 的生成逻辑可以从 di-orchestrator 中移除
enhancement
opened by sailxjx 3

add tasks to dijob spec

1. goal

There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

2. design *

Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

After change, the dijob can be defined as follow:

apiVersion: diengine.opendilab.org/v2alpha1
kind: DIJob
metadata:
  name: job-with-tasks
spec:
  priority: "normal"  # job priority, which is a reserved field for allocator
  backoffLimit: 0  # restart count
  cleanPodPolicy: "Running"  # the policy to clean pods after job completion
  preemptible: false  # job is preemtible or not
  minReplicas: 2  
  maxReplicas: 5
  tasks:
  - replicas: 1
    name: "learner"
    type: learner
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label learner xxx
          resources:
            requests:
              cpu: "1"
              nvidia.com/gpu: 1
        restartPolicy: Never
  - replicas: 1
    name: "evaluator"
    type: evaluator
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label evaluator xxx
        restartPolicy: Never
  - replicas: 2
    name: "collector"
    type: collector
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label collector xxx
        restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job created.
    reason: JobPending
    status: "False"
    type: Pending
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job is starting since all pods are created.
    reason: JobStarting
    status: "False"
    type: Starting
  phase: Starting
  profilings: {}
  readyReplicas: 0
  replicas: 4
  taskStatus:
    learner:
      Pending: 1
    evaluator:
      Pending: 1
    collector:
      Pending: 2
  reschedules: 0
  restarts: 0

task definition:

type Task struct {
	Name string `json:"name,omitempty"`

	Type TaskType `json:"type,omitempty"`

	Replicas int32 `json:"replicas,omitempty"`

	Template corev1.PodTemplateSpec `json:"template,omitempty"`
}

type TaskType string

const (
	TaskTypeLearner TaskType = "learner"

	TaskTypeCollector TaskType = "collector"

	TaskTypeEvaluator TaskType = "evaluator"

	TaskTypeNone TaskType = "none"
)

status.taskStatus definition:

type DIJobStatus struct {
  // Phase defines the observed phase of the job
  // +kubebuilder:default=Pending
  Phase Phase `json:"phase,omitempty"`

  // ...
  
  // map for different task statuses. key: task.name, value: TaskStatus
  TaskStatus map[string]TaskStatus

  // ...
}

// count of different pod phases
type TaskStatus map[corev1.PodPhase]int32

enhancement

opened by konnase 1

new version for di-engine new architecture
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 1
v0.2.0
[x] split webhook and operator

[x] add dockerfile.dev

[x] update CleanPolicyALL to CleanPolicyAll

[x] remove k8s service related operations from server, and operator is responsible for managing services

[x] add e2e test

enhancement
opened by konnase 1
refactor job spec
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

enhancement
opened by konnase 0
Release/v1.0
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 0
fix: job failed submit when collector/learner missed

job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.
bug

opened by konnase 0
Feat/job create event
add event handler for dijob, and mark job as Created when job submitted

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

version -> v0.2.1

enhancement
opened by konnase 0
allocate的一些问题

1.目前的allocator的逻辑，对于不可被抢占的job的初始分配，仅利用minreplicas修改replicas属性，那job的pods部署到哪个节点是完全由K8S决定吗？而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么？和是否能被调度是不是等价的？ 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现，这部分内容什么时候可以补充？ 4.文档中存在许多与最新代码不符合的地方，比如DIJob.Spec.Group属性在代码中已经被移除，文档中提到的job.spec.minreplicas属性代码中也没有，而是在JobInfo中。可以更新一下文档吗？感谢！

opened by RZ-Q 3

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)
bugs fix

judge which task a pod belongs to according to task name instead of task type (https://github.com/opendilab/DI-orchestrator/pull/27)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.2(Jul 21, 2022)
bugs fix

global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)

wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)

incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.1(Jul 4, 2022)
update status replicas and task status

add volumes to job spec

update status CompletionTimestamp when job completed

see details in https://github.com/opendilab/DI-orchestrator/pull/22
Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.0(Jun 30, 2022)
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

see details in https://github.com/opendilab/DI-orchestrator/pull/21
Source code(tar.gz)
Source code(zip)
di-manager.yaml(374.01 KB)
v1.0.0(Mar 23, 2022)
features

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface see https://github.com/opendilab/DI-orchestrator/pull/18

Source code(tar.gz)
Source code(zip)
di-manager.yaml(350.52 KB)
v0.2.2(Dec 15, 2021)
bug fix

resolve bug that job failed to submit when collector/learner missed (https://github.com/opendilab/DI-orchestrator/pull/14)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.1(Oct 12, 2021)
feature

add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.0(Sep 28, 2021)
change orchestrator image repository

version -> v0.2.0

Source code(tar.gz)
Source code(zip)
v0.2.0-rc.0(Sep 6, 2021)
split webhook and operator

add dockerfile.dev

update CleanPolicyALL to CleanPolicyAll

remove k8s service related operations from server, and operator is responsible for managing services

add e2e test

Source code(tar.gz)
Source code(zip)
v0.1.0(Jul 8, 2021)
Features

Define DIJob CRD to support DI jobs' submission

Define AggregatorConfig CRD to support aggregator definition

Add webhook to validate DIJob submission

Provide http service for DI jobs to request for DI modules

Docs to introduce DI-orchestrator architecture

Source code(tar.gz)
Source code(zip)

Owner

OpenDILab

Open sourced Decision Intelligence (DI)

GitHub Repository

Source code for the paper "Periodic Traveling Waves in an Integro-Difference Equation With Non-Monotonic Growth and Strong Allee Effect"

Source code for the paper "Periodic Traveling Waves in an Integro-Difference Equation With Non-Monotonic Growth and Strong Allee Effect" by Michael Ne

1 Apr 19, 2022

A framework that allows people to write their own Rocket League bots.

YOU PROBABLY SHOULDN'T PULL THIS REPO Bot Makers Read This! If you just want to make a bot, you don't need to be here. Instead, start with one of thes

543 Dec 20, 2022

Accelerate Neural Net Training by Progressively Freezing Layers

FreezeOut A simple technique to accelerate neural net training by progressively freezing layers. This repository contains code for the extended abstra

203 Jun 19, 2022

Code of TIP2021 Paper《SFace: Sigmoid-Constrained Hypersphere Loss for Robust Face Recognition》. We provide both MxNet and Pytorch versions.

SFace Code of TIP2021 Paper 《SFace: Sigmoid-Constrained Hypersphere Loss for Robust Face Recognition》. We provide both MxNet, PyTorch and Jittor versi

47 Nov 25, 2022

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms This repo contains the source code to reproduce the results in the paper A Close

73 Dec 24, 2022

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

45 Dec 29, 2022

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Related tags

Overview

DI Orchestrator

Prerequisites

Install DI Orchestrator

Submit DIJob

User Guide

Contributing

Comments

1. goal

2. design *

release notes

features

release notes

features

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)

bugs fix

v1.1.2(Jul 21, 2022)

bugs fix

v1.1.1(Jul 4, 2022)

v1.1.0(Jun 30, 2022)

v1.0.0(Mar 23, 2022)

features

v0.2.2(Dec 15, 2021)

bug fix

v0.2.1(Oct 12, 2021)

feature

v0.2.0(Sep 28, 2021)

v0.2.0-rc.0(Sep 6, 2021)

v0.1.0(Jul 8, 2021)

Features

Owner

OpenDILab

Source code for the paper "Periodic Traveling Waves in an Integro-Difference Equation With Non-Monotonic Growth and Strong Allee Effect"

A framework that allows people to write their own Rocket League bots.

Accelerate Neural Net Training by Progressively Freezing Layers

Code of TIP2021 Paper《SFace: Sigmoid-Constrained Hypersphere Loss for Robust Face Recognition》. We provide both MxNet and Pytorch versions.

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

A python interface for training Reinforcement Learning bots to battle on pokemon showdown

Simple Pixelbot for Diablo 2 Resurrected written in python and opencv.

Manipulation OpenAI Gym environments to simulate robots at the STARS lab

A semismooth Newton method for elliptic PDE-constrained optimization

SOTR: Segmenting Objects with Transformers [ICCV 2021]

Evaluating different engineering tricks that make RL work

N-Omniglot is a large neuromorphic few-shot learning dataset

Official PyTorch code for the paper: "Point-Based Modeling of Human Clothing" (ICCV 2021)

GUI for TOAD-GAN, a PCG-ML algorithm for Token-based Super Mario Bros. Levels.

UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

This repository contains the code and models for the following paper.

A collection of awesome resources image-to-image translation.

CLEAR algorithm for multi-view data association

Pipeline code for Sequential-GAM(Genome Architecture Mapping).