当前位置：网站首页>REINFORCE

REINFORCE

2022-04-23 02:56:00 【Live up to your youth】

Basic concepts

The goal of reinforcement learning problem is to perform a series of appropriate actions according to the strategy to maximize the cumulative return . Reinforcement learning algorithms are mainly divided into three categories ： The method based on value function 、 A policy based approach and a combination of the two . in other words , It can be reused by approximating the value function $\epsilon-greedy$ Strategy indirectly determines the strategy , You can also create a policy function , Parameterize the policy , We can also combine these two methods to learn value function , Learning strategies .

REINFORCE

REINFORCE It's a policy based algorithm . The strategy gradient method is used to parameterize the strategy , In the strategy gradient method , Policies often use a set of parameters $\theta$ The functional representation of ： $\pi_\theta(a|s)$ , Solve the update strategy parameter set $\theta$ That is, the calculation process of the strategic gradient method . The goal of the strategy gradient method is to find the optimal $\theta$ , Make the objective function （ Also called loss function ） Be able to maximize the expected return , The return value here is the sum of the returns from the initial state to the end state .

First, consider the one-step Markov decision process （MDP） Policy gradient . In this issue , Assuming that state s Obey the distribution d(s), End after one time step , Get paid r=r(s,a). Then the target function is ：
Insert picture description here
To maximize the objective function $J(\theta)$ , The gradient rise method is used to solve the problem ：

among $\alpha$ It's the step length , And the strategy gradient is ：

In multiple steps MDP In the calculation formula of strategy gradient , use Q- Value function $q_\pi(s,a)$ Replace $r (s, a)$ , Which is equivalent to one step MDP A generalization of the gradient calculation formula . therefore , Parameters $\theta$ The learning formula is ：
Insert picture description here
REINFORCE The pseudo code of the algorithm is shown in the figure below , Return on use $v_t$ Instead of Q- Value function $q_\pi(s,a)$ .

REINFORCE with Baseline

In multiple steps MDP Environment , The return on each step will have a high variance . If the baseline function is subtracted from the objective function when defining the objective function B(s), It can reduce the variance without changing the overall expectation , This will make the training process more stable . At this time there is ：
Insert picture description here
under these circumstances , Parameters $\theta$ The update method of is ：