当前位置：网站首页>ONEFLOW learning notes: from functor to opexprinter

ONEFLOW learning notes: from functor to opexprinter

2022-04-23 08:41:00 【ONEFLOW deep learning framework】

writing ｜ Moon step

to update ｜ Zhao Luyang

Previously written 《OneFlow Learning notes ：python To C++ Call process analysis 》, from Python The code caught up with Functor This floor , This paper starts from Functor Start chasing down , The back is OpExprInterpreter.

Functor review

Functor Layer as OneFlow Infrastructure of , by Python End sum C++ End provides op Unified access to operations , This is in 《python To C++ Call process analysis 》 There is a detailed analysis in , It uses Relu As an example , This is to minimize the cost of understanding , This article continues with Relu As an example, follow-up code , As listed above ReluFunctor Code for , In order to facilitate the cohesion of context , Make a simple list ：

class ReluFunctor {
 public:
  ReluFunctor() { op_ = CHECK_JUST(one::OpBuilder("relu").Input("x", 1).Output("y", 1).Build()); }
  Maybe<Tensor> operator()(const std::shared_ptr<Tensor>& x, bool inplace) const {
    ...
    return OpInterpUtil::Dispatch<Tensor>(*op_, {x});
  }
 private:
  std::shared_ptr<OpExpr> op_;
};

The code is simple , It can be divided into three parts ：

Define the data structure ： That is, class member variables op_, It is OpExpr type , This is the main part of section II below
Constructors ： Use OpBuilder This auxiliary class is for op_ It's initialized , Base note is the last call. Build() When , The internal call is described in Section 2 UserOpExpr Static functions in New To create
Function call operator overload function ： This is through a Dispatch Function to schedule specific calculations , Finally, it will be calculated on a specific device , There are too many details , The third section of this article first talks about part of the content , The complete chain will be summarized later

OpExpr

The operator is in OneFlow In the framework of OpExpr To represent , In addition to the representation operator , It can also represent some other operations , Have a look first OpExpr Inheritance system ：

chart 1

Operator corresponds to OpExpr It's usually the picture above 1 Orange in inherits the color at the bottom of the chain UserOpExpr, The code definition is located in oneflow/core/framework/op_expr.h, The others OpExpr I don't know much about , Make a summary after knowing something later , In the orange chain of succession , The main data structure of each class is as follows ：

1.OpExpr It's a virtual base class , No data members

2.BuiltinOpExpr Is a relatively high-level and important base class , The main maintenance is op_name、input arg、output arg Information ：

class BuiltinOpExpr : public OpExpr {
  std::string op_name_;
  std::shared_ptr<const ArgTuple> input_arg_tuple_;
  std::shared_ptr<const ArgTuple> output_arg_tuple_;
};

3.BuiltinOpExprImpl The main maintenance is op proto and grad func Information about , Subclass through the previous text 《C/C++ gossip ：CRTP》 Introduced CRTP To use this class , Mainly to reuse the interface , The template parameter type here is mainly determined by proto Type of file generation , This is also called ProtoType Why , In an effort to 1 Take the orange inheritance chain in as an example , The use of UserOpConf To instantiate , It is from oneflow/core/framework/user_op_conf.proto A data structure generated automatically , Now let's show BuiltinOpExprImpl and user_op_conf.proto Main content ：

template<typename ProtoType>
class BuiltinOpExprImpl : public BuiltinOpExpr {
  ProtoType op_proto_;
  mutable std::shared_ptr<OpExprGradFunctionIf> op_grad_func_;
};


// oneflow/core/framework/user_op_conf.proto
message UserOpConf {
  message ListString { repeated string s = 1; }
  required string op_type_name = 1;
  map<string, ListString> input = 2;
  map<string, ListString> output = 3;
  map<string, AttrValue> attr = 4;
  repeated string input_order = 5;
  repeated string output_order = 6;
}

4. And finally UserOpExpr, It maintains some op Of attrs、shape Of infer function、dtype Of infer function Etc ：

class UserOpExpr final : public BuiltinOpExprImpl<UserOpConf> {
  AttrMap base_attrs_;
  user_op::TensorDescInferFn shape_infer_fn_;
  user_op::DataTypeInferFn dtype_infer_fn_;
  user_op::DeviceInferFn device_infer_fn_;
  mutable HashMap<Symbol<Device>, std::shared_ptr<StatefulLocalOpKernel>> device2kernel_;
  std::shared_ptr<ConsistentTensorInferCache> consistent_tensor_infer_cache_;


public:
  static Maybe<UserOpExpr> New(const std::string& op_name, ...);
};

The interface part of these classes basically corresponds to the data structure , You can make up your own brain , Only one is listed above UserOpExpr Static New Interface , It's used to create a UserOpExpr object , Ahead one::OpBuilder("relu") You'll eventually call this function to create OpExpr object .

OpExprInterpreter

simply ,OpExprInterpreter Used to base on OpExpr Different types for distribution , That is, the processing flow that cannot be connected later , This is in OneFlow It is called different execution modes , at present OneFlow The supported execution modes are eager and lazy, among eager Can be further subdivided into mirror and consistent（ notes ：OneFlow v0.7.0 After the version, it is collectively referred to as “global”）, As shown in the figure below ：

chart 2

Obvious , above OpExprInterpreter A total of the above-mentioned are derived mirror、consistent、lazy Three interpreter, besides , chart 2 There is also one marked orange in the AutogradInterpreter class , It and OpExprInterpreter Between has-a The relationship between , And provide a Appy Interface to choose three execution modes , Here is the simplified code ：

class AutogradInterpreter {
  std::shared_ptr<OpExprInterpreter> internal_;
public:
  Maybe<void> Apply(const OpExpr& op_expr, ...) const { ... }
};

Let's start with the beginning of the article ReluFunctor In the code OpInterpUtil::Dispatch Come on, start chasing , Called here Dispatch In the definition of oneflow/core/framework/op_interpreter/op_interpreter_util.h, This is a series of overloaded functions , You can simply think of them as a pile helper function, No matter which overloaded version is called dispatch, Will eventually be imported into the following overloaded version Dispatch in , be located oneflow/core/framework/op_interpreter/op_interpreter_util.cpp+142：

Maybe<void> OpInterpUtil::Dispatch(
      const OpExpr& op_expr, 
      const TensorTuple& inputs,
      TensorTuple* outputs,
      const OpExprInterpContext& ctx) {
  return JUST(GetInterpreter(inputs, ctx, op_expr))->Apply(op_expr, inputs, outputs, ctx);
}

Let's first look at some parameters here ,op_expr It was created earlier UserOpExpr Object of type ,TensorTuple It can be simply considered as vector<Tensor>,inputs/outputs That is, the corresponding input and output Tensor,OneFlow Medium Tensor For details, please refer to the above 《Global View Related concepts and implementation of 》 The third section of , The last parameter is OpExprInterpContext type , It is mainly used to preserve op Of attributes Information , Defined in oneflow/core/framework/op_interpreter.h+36, Here are the main data structures ：

struct OpExprInterpContext {
  ...
  AttrMap attrs;
  Optional<Symbol<Device>> device;
  Optional<Symbol<ParallelDesc>> parallel_desc;
  Optional<Symbol<cfg::NdSbp>> nd_sbp;
  std::shared_ptr<user_op::OpKernelState> state;
};

Go on and watch OpInterpUtil::Dispatch Medium GetInterpreter() call , It will create the previous figure based on the context information provided 2 Shown AutogradInterpreter object ：

Maybe<AutogradInterpreter> GetInterpreter(const TensorTuple& inputs, const OpExprInterpContext& ctx,
                                          const OpExpr& op_expr) {
  static const auto& g_lazy_interpreter = BuildLazyInterpreter();
  static const auto& g_eager_consistent_interpreter = BuildEagerInterpreter(/*is_mirrored=*/false);
  static const auto& g_eager_mirrored_interpreter = BuildEagerInterpreter(/*is_mirrored=*/true);
  if (!LazyMode::is_enabled()) {
    if (inputs.empty()) {
      if (ctx.parallel_desc.has_value()) {
        JUST(ctx.nd_sbp);
        CHECK_OR_RETURN(!ctx.device.has_value());
        return g_eager_consistent_interpreter;
      } else {
        CHECK_OR_RETURN(!ctx.nd_sbp.has_value());
        return g_eager_mirrored_interpreter;
      }
    }
...

Then use the created AutogradInterpreter Object called AutogradInterpreter Of Apply Interface to choose three execution modes , Its implementation lies in oneflow/core/framework/op_interpreter/op_interpreter.cpp+86：

Maybe<void> AutogradInterpreter::Apply(const OpExpr& op_expr, const TensorTuple& inputs,
                                       TensorTuple* outputs, const OpExprInterpContext& ctx) const {
  bool requires_grad = false;
  if (autograd::GradMode::is_enabled() && !JUST(op_expr.IsGradDisabled())) {
    requires_grad =
        std::any_of(inputs.begin(), inputs.end(),
                    [](const std::shared_ptr<Tensor>& tensor) { return tensor->requires_grad(); });
  }
  {
    autograd::AutoGradMode mode(false);
    JUST(internal_->Apply(op_expr, inputs, outputs, ctx));
  }
  // Lazy mode will construct backward compute graph in passes, so disable autograd if lazy mode.
  std::shared_ptr<OpExprGradClosure> grad_closure(nullptr);
  if (requires_grad && !LazyMode::is_enabled()) {
    grad_closure = JUST(op_expr.GetOrCreateOpGradClosure());
    auto backward_fn =
        std::make_shared<std::function<Maybe<void>(const TensorTuple&, TensorTuple*, bool)>>(
            [=](const TensorTuple& out_grads, TensorTuple* in_grads,
                bool create_graph) -> Maybe<void> {
              autograd::AutoGradMode mode(create_graph);
              JUST(grad_closure->Apply(out_grads, in_grads));
              return Maybe<void>::Ok();
            });
    JUST(GetThreadLocalAutogradEngine()->AddBackwardFuncPtr(op_expr.op_type_name() + "_backward",
                                                            backward_fn, inputs, outputs));
  }
  // Update outputs autograd meta
  // Note: if requires_grad is True, we will create a new autograd meta for each output
  // in `AddBackwardFuncPtr` to support inplace operation, so the update should after
  // `AddBackwardFuncPtr`
  for (auto& output : *outputs) {
    output->set_is_leaf(inputs.size() == 0 || !requires_grad);
    if (!output->requires_grad()) {
      JUST(output->set_requires_grad(
          requires_grad && IsSupportRequireGradDataType(output->dtype()->data_type())));
    }
  }
  if (requires_grad && !LazyMode::is_enabled()) {
    // Capture inputs and outputs after `AddBackwardFuncPtr` because of that grad function
    // node has been attached to them.
    JUST(grad_closure->Capture(inputs, *outputs, ctx));
  }
  return Maybe<void>::Ok();
}

Here's the main thing JUST(internal_->Apply(op_expr, inputs, outputs, ctx));（ The codes in the following columns are similar to backward relevant , This article only focuses on forward This is the main line ）, It actually calls what it holds OpExprInterpreter Of Apply function , According to the previous figure 2 Three of them Interpreter Let's take a look at the following process .

3.1 Mirror mode

If we choose mirror Execution mode of ,internal_->Apply It will actually call EagerMirroredInterpreter Base class of EagerInterpreter Medium Apply, be located oneflow/core/framework/op_interpreter/op_interpreter.cpp+51：

Maybe<void> EagerInterpreter::Apply(const OpExpr& op_expr, ...) const {
#define APPLY_IF(op_type)                                              \
  if (const auto* op = dynamic_cast<const op_type##Expr*>(&op_expr)) { \
    return ApplyImpl(*op, inputs, outputs, ctx);                       \
  }


  APPLY_IF(UserOp);
  APPLY_IF(VariableOp);
  APPLY_IF(CastToMirroredOp);
  ...
}

It's actually used here dynamic_cast According to OpExpr The actual type of is distributed dynamically , It can also be combined with the following figure to help understand ：

chart 3

We are from ReluFunctor From , Created is UserOpExpr, So here we call EagerMirroredInterpreter The following one in the ApplyImpl function , be located oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp+191：

Maybe<void> EagerMirroredInterpreter::ApplyImpl(const UserOpExpr& op_expr,
                                                const TensorTuple& inputs, TensorTuple* outputs,
                                                const OpExprInterpContext& ctx) const {
  return NaiveInterpret(op_expr, inputs, outputs, ctx);
}

Here we continue to call... In the same file NaiveInterpret function , This function is very long , Mainly doing entry OneFlow Preparation before virtual machine , One of the most important preparations is based on input and output Tensor Object to create the virtual machine vm::EagerBlobObject object , Its definition lies in oneflow/core/eager/eager_blob_object.h+83, The main data members are as follows ：

class EagerBlobObject final : public BlobObject {
  std::unique_ptr<Blob> blob_;
  std::unique_ptr<char[]> header_buffer_;
  std::shared_ptr<TensorStorage> tensor_storage_;
  std::atomic<bool> is_shape_synced_;
  int64_t storage_offset_;
  intrusive::shared_ptr<LocalDepObject> compute_local_dep_object_;
};

stay EagerBlobObject In the data members of ,Blob and TensorStorage Maintain real data storage space , in addition , As can be seen from the code above ,EagerBlobObject There is also inheritance , The summary is as follows ：

chart 4

About NaiveInterpret, More content , Mainly in preparation for entering the virtual machine , The last piece of code is shown below , It's entry OneFlow Virtual machine entry ：

Maybe<void> NaiveInterpret(const UserOpExpr& user_op_expr, ...) {
  ...
  JUST(PhysicalRun([&](InstructionsBuilder* builder) -> Maybe<void> {
    return builder->LocalCallOpKernel(
        kernel, 
        input_eager_blob_objects, 
        output_eager_blob_objects,
        ctx, 
        op_device);
  }));
  return Maybe<void>::Ok();
}

Virtual machines are outside the scope of this article , Take time to continue your study in the future .

3.2 Global mode

About Global The concept of , above 《Global View Related concepts and implementation of 》 There is a detailed analysis , Here we will use the concept directly . If we choose Global Execution mode of ,internal_->Apply Actual and mirror The pattern will also call EagerInterpreter Medium Apply, be located oneflow/core/framework/op_interpreter/op_interpreter.cpp+51：

Maybe<void> EagerInterpreter::Apply(const OpExpr& op_expr, ...) const {
#define APPLY_IF(op_type)                                              \
  if (const auto* op = dynamic_cast<const op_type##Expr*>(&op_expr)) { \
    return ApplyImpl(*op, inputs, outputs, ctx);                       \
  }


  APPLY_IF(UserOp);
  APPLY_IF(VariableOp);
  APPLY_IF(CastToMirroredOp);
  ...
}

Use here dynamic_cast According to OpExpr The actual type of dynamic （ The example of this article is UserOpExpr This type ） Distributed to EagerConsistentInterpreter Medium ApplyImpl function , Definition is located oneflow/core/framework/op_interpreter/eager_consistent_op_interpreter.cpp+194：

Maybe<void> EagerConsistentInterpreter::ApplyImpl(const UserOpExpr& op_expr,
                                                  const TensorTuple& inputs, TensorTuple* outputs,
                                                  const OpExprInterpContext& ctx) const {
  return InterpretThenInitConsistentId(op_expr, inputs, outputs, ctx);
}

here InterpretThenInitConsistentId Is a function pointer , Point to use NonRecursiveInitConsistentId As a decorative device to package Interpret The function of this function , Let's take a brief look at the code of the decorator , First look at DECORATE macro , be located oneflow/core/common/decorator.h+39：

template<template<typename...> class Decorator>
struct WithDecorator final {
  template<typename T, typename = void>
  struct Decorate;
  template<typename T, typename... Args>
  struct Decorate<T (*)(Args...)> final {
    template<T (*func)(Args...)>
    static T Call(Args... args) {
      return Decorator<T, Args...>::template Call<func>(args...);
    }
  };
};


#define DECORATE(fn_ptr, decorator) \
  (&WithDecorator<decorator>::Decorate<decltype(fn_ptr)>::Call<fn_ptr>)

among WithDecorator It's a decorator wrapper ,Decorator Is the template type parameter of its template , Represents the actual decorator , And then invoke the actual decorator. Call function , In this case WithDecorator Use NonRecursiveInitConsistentId As Decorator To instantiate ,NonRecursiveInitConsistentId Definition is located oneflow/core/framework/tensor_consistent_id.h+35：

template<typename Arg0, typename Arg1, typename... Args>
struct NonRecursiveInitConsistentId<Maybe<void>, Arg0, Arg1, TensorTuple*, Args...> {
  template<Maybe<void> (*func)(Arg0, Arg1, TensorTuple*, Args...)>
  static Maybe<void> Call(Arg0 arg0, Arg1 arg1, TensorTuple* outputs, Args... args) {
    auto* recursive_depth = MutThreadLocalConsistentIdDepth();
    ++*recursive_depth;
    Maybe<void> ret = func(arg0, arg1, outputs, args...);
    --*recursive_depth;
    if (*recursive_depth == 0 && ret.IsOk()) { JUST(InitConsistentId(outputs)); }
    return ret;
  }
};

As can be seen from the above NonRecursiveInitConsistentId This Decorator The function of is to ensure InitConsistentId Executed only once . Continue to look at eager The main line of the pattern , Which is decorated by this ornament Interpret This function , be located oneflow/core/framework/op_interpreter/eager_consistent_op_interpreter.cpp+112, This function also contains a little more , To sum up, I mainly did the following things ：

Create the previous 《Global View Related concepts and implementation of 》 In Section 3 ConsistentTensorMeta Information , Stored in ConsistentTensorInferResult In this data structure
by output Create the corresponding EagerConsistentTensorImpl and ConsistentTensor
According to the input and output Tensor, Create the previous figure 3 Displayed vm::EagerBlobObject object , These objects will be in OneFlow Used in virtual machines , In the middle of this may do boxing The operation of , This part is not familiar at present , I'll summarize it separately after I'm familiar with it
Enter virtual machine , Schedule and execute the current op

The simplified code is shown below ：

Maybe<void> Interpret(const UserOpExpr& user_op_expr, const TensorTuple& inputs,
                      TensorTuple* outputs, const OpExprInterpContext& ctx) {
  // step 1
  const auto& infer_args = JUST(ConsistentTensorMetaInferArgs::New(ctx.attrs, inputs));
  std::shared_ptr<const ConsistentTensorInferResult> result =
      JUST(user_op_expr.mut_consistent_tensor_infer_cache()->GetOrInfer(*infer_args));
  const auto& output_tensor_metas = result->output_tensor_metas();
  // step 2
  for (int i = 0; i < outputs->size(); ++i) {
    if (!outputs->at(i)) {
      const auto& tensor_impl = JUST(EagerConsistentTensorImpl::New(
          output_tensor_metas.at(i), tensor_device, parallel_id, false, false));
      outputs->at(i).reset(new ConsistentTensor(tensor_impl));
    }
  }
  // step 3
  for (int i = 0; i < inputs.size(); ++i) {
    const auto& local_tensor = JUST(input->cur_rank_phy_tensor());
    input_eager_blob_objects->at(i) = JUST(local_tensor->eager_blob_object());
  }
  for (int i = 0; i < outputs->size(); ++i) {
    const auto& local_tensor = JUST(outputs->at(i)->cur_rank_phy_tensor());
    output_eager_blob_objects->at(i) = JUST(local_tensor->eager_blob_object());
  }
  // step 4
  JUST(PhysicalRun([&](InstructionsBuilder* builder) -> Maybe<void> {
    return builder->LocalCallOpKernel(kernel, input_eager_blob_objects, output_eager_blob_objects,
                                      result, ctx, result->op_device());
  }));
  return Maybe<void>::Ok();
}

This is before entering the virtual machine EagerMirroredInterpreter About the main line of work .

3.3 Lazy mode

This part is not familiar at present , After you are familiar with it, you can summarize it separately .

This article mainly combs OpExprInterpreter Main responsibilities and related implementation , The main reference is OneFlow The official code and some previous related articles , Here are the related links ：