当前位置:网站首页>REINFORCE
REINFORCE
2022-04-23 02:56:00 【Live up to your youth】
Basic concepts
The goal of reinforcement learning problem is to perform a series of appropriate actions according to the strategy to maximize the cumulative return . Reinforcement learning algorithms are mainly divided into three categories : The method based on value function 、 A policy based approach and a combination of the two . in other words , It can be reused by approximating the value function ϵ − g r e e d y \epsilon-greedy ϵ−greedy Strategy indirectly determines the strategy , You can also create a policy function , Parameterize the policy , We can also combine these two methods to learn value function , Learning strategies .
REINFORCE
REINFORCE It's a policy based algorithm . The strategy gradient method is used to parameterize the strategy , In the strategy gradient method , Policies often use a set of parameters θ \theta θ The functional representation of : π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s), Solve the update strategy parameter set θ \theta θ That is, the calculation process of the strategic gradient method . The goal of the strategy gradient method is to find the optimal θ \theta θ, Make the objective function ( Also called loss function ) Be able to maximize the expected return , The return value here is the sum of the returns from the initial state to the end state .
First, consider the one-step Markov decision process (MDP) Policy gradient . In this issue , Assuming that state s Obey the distribution d(s), End after one time step , Get paid r=r(s,a). Then the target function is :

To maximize the objective function J ( θ ) J(\theta) J(θ), The gradient rise method is used to solve the problem :

among α \alpha α It's the step length , And the strategy gradient is :

In multiple steps MDP In the calculation formula of strategy gradient , use Q- Value function q π ( s , a ) q_\pi(s,a) qπ(s,a) Replace r ( s , a ) r(s,a) r(s,a), Which is equivalent to one step MDP A generalization of the gradient calculation formula . therefore , Parameters θ \theta θ The learning formula is :

REINFORCE The pseudo code of the algorithm is shown in the figure below , Return on use v t v_t vt Instead of Q- Value function q π ( s , a ) q_\pi(s,a) qπ(s,a).

REINFORCE with Baseline
In multiple steps MDP Environment , The return on each step will have a high variance . If the baseline function is subtracted from the objective function when defining the objective function B(s), It can reduce the variance without changing the overall expectation , This will make the training process more stable . At this time there is :

under these circumstances , Parameters θ \theta θ The update method of is :

版权声明
本文为[Live up to your youth]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220657127048.html
边栏推荐
- grain rain
- Introduction to ACM [TSP problem]
- Android high-level interview must ask: overall business and project architecture design and reconstruction
- Machine learning (Zhou Zhihua) Chapter 14 probability graph model
- Classification of technology selection (2022)
- 【工欲善其事必先利其器】论文编辑及文献管理(Endnote,Latex,JabRef ,overleaf)资源下载及使用指南
- Linux redis - redis database caching service
- Airtrack cracking wireless network password (Dictionary running method)
- Basic workflow of CPU
- Publish to NPM?
猜你喜欢

Fashion MNIST 数据集分类训练

ele之Table表格的封装

Modification du contenu de la recherche dans la boîte déroulante par PHP + MySQL

LeetCode 1450 - 1453
![[hcip] detailed explanation of six LSAS commonly used by OSPF](/img/31/3b92d42d16a056bf9db9e24471cefd.jpg)
[hcip] detailed explanation of six LSAS commonly used by OSPF

Airtrack cracking wireless network password (Dictionary running method)

Android high-level interview must ask: overall business and project architecture design and reconstruction

grain rain

基于多态的职工管理系统源码与一些理解

国产轻量级看板式Scrum敏捷项目管理工具
随机推荐
基于ele封装下拉菜单等组件
@Usage and difference between mapper and @ repository
Decision tree principle of machine learning
Gavl021, gavl281, AC220V to 5v200ma small volume non isolated chip scheme
机器学习(周志华) 第十四章概率图模型
Innovation and management based on Scrum
Processes and threads
Niuke white moon race 5 [problem solving mathematics field]
重大危险源企业如何保障年底前完成双预防机制数字化建设任务
Typescript Learning Guide
[if you want to do a good job, you must first use its tools] Guide for downloading and using paper editing and document management (endnote, latex, jabref, overflow) resources
Microservices (distributed architecture)
tf. keras. layers. Inputlayer function
grain rain
Kubernetes - Introduction to actual combat
The shell monitors the depth of the IBM MQ queue and scans it three times in 10s. When the depth value exceeds 5 for more than two times, the queue name and depth value are output.
Shell script learning -- practical case
JZ76 删除链表中重复的结点
Specific field information of MySQL export table (detailed operation of Navicat client)
Résumé du gestionnaire de projet du système d'information Chapitre VI gestion des ressources humaines du projet