当前位置：网站首页>Depth deterministic strategy gradient (ddpg)

Depth deterministic strategy gradient (ddpg)

2022-04-23 02:56:00 【Live up to your youth】

Basic concepts

Discrete action & Continuous action

Discrete actions are actions that can be classified , For example, go to 、 Next 、 Left 、 Right 、 Jump and so on , Generally, multi classification activation function is used softmax These actions mean . If there are only two actions , You can use sigmoid Activation function .

A continuous action is a continuous value , Like speed 、 angle 、 Force, etc. denote the exact value . Continuous action is not classifiable , Generally, activation functions of return value type are used to represent them , such as tanh function .

Insert picture description here
Pictured , If we want to use reinforcement learning to train a strategy to control the manipulator , The upper shaft can be in [0, 2 $\pi$ ] Turn between , The lower axis can be in [0, $\pi$ ] Turn between , Then its action space will be a multi-dimensional continuous space .

DDPG

DQN The algorithm uses neural networks to approximate the representation of Q- function , Successfully solved the problem of high-dimensional state space . however DQN Can only handle discrete 、 Low dimensional action space , For many in life, there is continuity 、 Control tasks in high-dimensional action space ,DQN Unable to deal with .

DDPG The algorithm is proposed to solve such a problem , You can be DQN An improved version of the algorithm . stay DQN in , use $\epsilon-greedy$ Strategy or Boltzamann Distribute strategies to select actions a; and DDPG A neural network is used to fit the strategy function , Direct output action a, It can deal with the output of continuous action and high-dimensional action space . therefore ,DDPG It can be regarded as a continuous action space DQN.
Insert picture description here
DDPG The training process is as follows , Be similar to DQN,DDPG Added policy network （Actor）.DDPG It has two parts , A strategic network （Actor）, A value network （Critic）. Strategy network output action , Value networks judge actions . Both have their own ways to update information . The strategy network is updated by gradient calculation formula , And the value network is updated according to the target value .DDPG Update and of value network DQN similar , Here is the update of the policy network in detail .
Insert picture description here
Strategy gradient formula ：

This formula is easy to understand , For example, for the same state , We output two different actions a1 and a2, Two feedback results are obtained from the state estimation network Q value , Namely Q1 and Q2, hypothesis Q1>Q2, Take action a1 You can get more rewards . So what is the idea of a strategy gradient ？ Is to increase a1 Probability , Reduce a2 Probability , in other words ,Actor I want to get as big as possible Q value . So our Actor Loss can be simply interpreted as feedback Q The greater the value, the smaller the loss , Feedback received Q The smaller the value, the greater the loss .