OpenDILab RL Kubernetes Custom Resource and Operator Lib

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

  • A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
  • Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.
kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Contact us throw [email protected]

Comments
  • 在 Pod 内增加集群信息

    在 Pod 内增加集群信息

    希望以 dijob replica 方式提交时,每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序,增加以下几个环境变量:

    1. replica 中所有 pod 的 FQDN,依据启动顺序排序
    2. 当前 pod 的 FQDN
    3. 当前 pod 的顺序编号

    DI-engine 中会根据这些变量实现对应的网络连接,attach-to 的生成逻辑可以从 di-orchestrator 中移除

    enhancement 
    opened by sailxjx 3
  • add tasks to dijob spec

    add tasks to dijob spec

    1. goal

    There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

    2. design *

    Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

    After change, the dijob can be defined as follow:

    apiVersion: diengine.opendilab.org/v2alpha1
    kind: DIJob
    metadata:
      name: job-with-tasks
    spec:
      priority: "normal"  # job priority, which is a reserved field for allocator
      backoffLimit: 0  # restart count
      cleanPodPolicy: "Running"  # the policy to clean pods after job completion
      preemptible: false  # job is preemtible or not
      minReplicas: 2  
      maxReplicas: 5
      tasks:
      - replicas: 1
        name: "learner"
        type: learner
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label learner xxx
              resources:
                requests:
                  cpu: "1"
                  nvidia.com/gpu: 1
            restartPolicy: Never
      - replicas: 1
        name: "evaluator"
        type: evaluator
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label evaluator xxx
            restartPolicy: Never
      - replicas: 2
        name: "collector"
        type: collector
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label collector xxx
            restartPolicy: Never
    status:
      conditions:
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job created.
        reason: JobPending
        status: "False"
        type: Pending
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job is starting since all pods are created.
        reason: JobStarting
        status: "False"
        type: Starting
      phase: Starting
      profilings: {}
      readyReplicas: 0
      replicas: 4
      taskStatus:
        learner:
          Pending: 1
        evaluator:
          Pending: 1
        collector:
          Pending: 2
      reschedules: 0
      restarts: 0
    

    task definition:

    type Task struct {
    	Name string `json:"name,omitempty"`
    
    	Type TaskType `json:"type,omitempty"`
    
    	Replicas int32 `json:"replicas,omitempty"`
    
    	Template corev1.PodTemplateSpec `json:"template,omitempty"`
    }
    
    type TaskType string
    
    const (
    	TaskTypeLearner TaskType = "learner"
    
    	TaskTypeCollector TaskType = "collector"
    
    	TaskTypeEvaluator TaskType = "evaluator"
    
    	TaskTypeNone TaskType = "none"
    )
    
    

    status.taskStatus definition:

    type DIJobStatus struct {
      // Phase defines the observed phase of the job
      // +kubebuilder:default=Pending
      Phase Phase `json:"phase,omitempty"`
    
      // ...
      
      // map for different task statuses. key: task.name, value: TaskStatus
      TaskStatus map[string]TaskStatus
    
      // ...
    }
    
    // count of different pod phases
    type TaskStatus map[corev1.PodPhase]int32
    
    enhancement 
    opened by konnase 1
  • new version for di-engine new architecture

    new version for di-engine new architecture

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 1
  • v0.2.0

    v0.2.0

    • [x] split webhook and operator
    • [x] add dockerfile.dev
    • [x] update CleanPolicyALL to CleanPolicyAll
    • [x] remove k8s service related operations from server, and operator is responsible for managing services
    • [x] add e2e test
    enhancement 
    opened by konnase 1
  • refactor job spec

    refactor job spec

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure
    enhancement 
    opened by konnase 0
  • Release/v1.0

    Release/v1.0

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 0
  • fix: job failed submit when collector/learner missed

    fix: job failed submit when collector/learner missed

    job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.

    bug 
    opened by konnase 0
  • Feat/job create event

    Feat/job create event

    • add event handler for dijob, and mark job as Created when job submitted
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    • version -> v0.2.1
    enhancement 
    opened by konnase 0
  • allocate的一些问题

    allocate的一些问题

    1.目前的allocator的逻辑,对于不可被抢占的job的初始分配,仅利用minreplicas修改replicas属性,那job的pods部署到哪个节点是完全由K8S决定吗?而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么?和是否能被调度是不是等价的? 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现,这部分内容什么时候可以补充? 4.文档中存在许多与最新代码不符合的地方,比如DIJob.Spec.Group属性在代码中已经被移除,文档中提到的job.spec.minreplicas属性代码中也没有,而是在JobInfo中。可以更新一下文档吗? 感谢!

    opened by RZ-Q 3
Releases(v1.1.3)
  • v1.1.3(Aug 22, 2022)

  • v1.1.2(Jul 21, 2022)

    bugs fix

    • global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)
    • wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)
    • incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(445.36 KB)
  • v1.1.1(Jul 4, 2022)

  • v1.1.0(Jun 30, 2022)

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure

    see details in https://github.com/opendilab/DI-orchestrator/pull/21

    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(374.01 KB)
  • v1.0.0(Mar 23, 2022)

  • v0.2.2(Dec 15, 2021)

  • v0.2.1(Oct 12, 2021)

    feature

    • add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(1.38 MB)
  • v0.2.0(Sep 28, 2021)

  • v0.2.0-rc.0(Sep 6, 2021)

    • split webhook and operator
    • add dockerfile.dev
    • update CleanPolicyALL to CleanPolicyAll
    • remove k8s service related operations from server, and operator is responsible for managing services
    • add e2e test
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jul 8, 2021)

    Features

    • Define DIJob CRD to support DI jobs' submission
    • Define AggregatorConfig CRD to support aggregator definition
    • Add webhook to validate DIJob submission
    • Provide http service for DI jobs to request for DI modules
    • Docs to introduce DI-orchestrator architecture
    Source code(tar.gz)
    Source code(zip)
Owner
OpenDILab
Open sourced Decision Intelligence (DI)
OpenDILab
Traffic4D: Single View Reconstruction of Repetitious Activity Using Longitudinal Self-Supervision

Traffic4D: Single View Reconstruction of Repetitious Activity Using Longitudinal Self-Supervision Project | PDF | Poster Fangyu Li, N. Dinesh Reddy, X

25 Dec 21, 2022
Tensorflow-seq2seq-tutorials - Dynamic seq2seq in TensorFlow, step by step

seq2seq with TensorFlow Collection of unfinished tutorials. May be good for educational purposes. 1 - simple sequence-to-sequence model with dynamic u

Matvey Ezhov 1k Dec 17, 2022
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

225 Dec 25, 2022
How the Deep Q-learning method works and discuss the new ideas that makes the algorithm work

Deep Q-Learning Recommend papers The first step is to read and understand the method that you will implement. It was first introduced in a 2013 paper

1 Jan 25, 2022
Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training

SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training Introduction This is a PyTorch implementation of "

weijiawu 34 Nov 09, 2022
Differentiable Wavetable Synthesis

Differentiable Wavetable Synthesis

4 Feb 11, 2022
Ratatoskr: Worcester Tech's conference scheduling system

Ratatoskr: Worcester Tech's conference scheduling system In Norse mythology, Ratatoskr is a squirrel who runs up and down the world tree Yggdrasil to

4 Dec 22, 2022
Python implementation of Wu et al (2018)'s registration fusion

reg-fusion Projection of a central sulcus probability map using the RF-ANTs approach (right hemisphere shown). This is a Python implementation of Wu e

Dan Gale 26 Nov 12, 2021
Mercury: easily convert Python notebook to web app and share with others

Mercury Share your Python notebooks with others Easily convert your Python notebooks into interactive web apps by adding parameters in YAML. Simply ad

MLJAR 2.2k Dec 27, 2022
Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task

multi-task_losses_optimizer Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task 已经实验过了,不会有cuda out of memory情况 ##Par

14 Dec 25, 2022
This is the source code of the 1st place solution for segmentation task (with Dice 90.32%) in 2021 CCF BDCI challenge.

1st place solution in CCF BDCI 2021 ULSEG challenge This is the source code of the 1st place solution for ultrasound image angioma segmentation task (

Chenxu Peng 30 Nov 22, 2022
A micro-game "flappy bird".

1-o-flappy A micro-game "flappy bird". Gameplays The game will be installed at /usr/bin . The name of it is "1-o-flappy". You can type "1-o-flappy" to

1 Nov 06, 2021
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble

datasketch: Big Data Looks Small datasketch gives you probabilistic data structures that can process and search very large amount of data super fast,

Eric Zhu 1.9k Jan 07, 2023
NAS-FCOS: Fast Neural Architecture Search for Object Detection (CVPR 2020)

NAS-FCOS: Fast Neural Architecture Search for Object Detection This project hosts the train and inference code with pretrained model for implementing

Ning Wang 180 Dec 06, 2022
TEA: A Sequential Recommendation Framework via Temporally Evolving Aggregations

TEA: A Sequential Recommendation Framework via Temporally Evolving Aggregations Requirements python 3.6 torch 1.9 numpy 1.19 Quick Start The experimen

DMIRLAB 4 Oct 16, 2022
Keras implementation of Deeplab v3+ with pretrained weights

Keras implementation of Deeplabv3+ This repo is not longer maintained. I won't respond to issues but will merge PR DeepLab is a state-of-art deep lear

1.3k Dec 07, 2022
GNPy: Optical Route Planning and DWDM Network Optimization

GNPy is an open-source, community-developed library for building route planning and optimization tools in real-world mesh optical networks

Telecom Infra Project 140 Dec 19, 2022
This code is the implementation of the paper "Coherence-Based Distributed Document Representation Learning for Scientific Documents".

Introduction This code is the implementation of the paper "Coherence-Based Distributed Document Representation Learning for Scientific Documents". If

tsc 0 Jan 11, 2022
Neural style in TensorFlow! 🎨

neural-style An implementation of neural style in TensorFlow. This implementation is a lot simpler than a lot of the other ones out there, thanks to T

Anish Athalye 5.5k Dec 29, 2022
Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

DiLBERT Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP" Pretrained Model The pretrained model presented in the paper is

Kevin Roitero 2 Dec 15, 2022