当前位置:网站首页>REINFORCE
REINFORCE
2022-04-23 02:56:00 【Live up to your youth】
Basic concepts
The goal of reinforcement learning problem is to perform a series of appropriate actions according to the strategy to maximize the cumulative return . Reinforcement learning algorithms are mainly divided into three categories : The method based on value function 、 A policy based approach and a combination of the two . in other words , It can be reused by approximating the value function ϵ − g r e e d y \epsilon-greedy ϵ−greedy Strategy indirectly determines the strategy , You can also create a policy function , Parameterize the policy , We can also combine these two methods to learn value function , Learning strategies .
REINFORCE
REINFORCE It's a policy based algorithm . The strategy gradient method is used to parameterize the strategy , In the strategy gradient method , Policies often use a set of parameters θ \theta θ The functional representation of : π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s), Solve the update strategy parameter set θ \theta θ That is, the calculation process of the strategic gradient method . The goal of the strategy gradient method is to find the optimal θ \theta θ, Make the objective function ( Also called loss function ) Be able to maximize the expected return , The return value here is the sum of the returns from the initial state to the end state .
First, consider the one-step Markov decision process (MDP) Policy gradient . In this issue , Assuming that state s Obey the distribution d(s), End after one time step , Get paid r=r(s,a). Then the target function is :
To maximize the objective function J ( θ ) J(\theta) J(θ), The gradient rise method is used to solve the problem :
among α \alpha α It's the step length , And the strategy gradient is :
In multiple steps MDP In the calculation formula of strategy gradient , use Q- Value function q π ( s , a ) q_\pi(s,a) qπ(s,a) Replace r ( s , a ) r(s,a) r(s,a), Which is equivalent to one step MDP A generalization of the gradient calculation formula . therefore , Parameters θ \theta θ The learning formula is :
REINFORCE The pseudo code of the algorithm is shown in the figure below , Return on use v t v_t vt Instead of Q- Value function q π ( s , a ) q_\pi(s,a) qπ(s,a).
REINFORCE with Baseline
In multiple steps MDP Environment , The return on each step will have a high variance . If the baseline function is subtracted from the objective function when defining the objective function B(s), It can reduce the variance without changing the overall expectation , This will make the training process more stable . At this time there is :
under these circumstances , Parameters θ \theta θ The update method of is :
版权声明
本文为[Live up to your youth]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220657127048.html
边栏推荐
- Shell learning notes -- shell processing of output stream awk
- Leangoo brain map - shared multi person collaborative mind mapping tool
- Basic workflow of CPU
- Machine learning (Zhou Zhihua) Chapter 14 probability graph model
- tf. keras. layers. Inputlayer function
- Wepy learning record
- L2-006 樹的遍曆(中後序確定二叉樹&層序遍曆)
- How can enterprises with major hazard installations ensure the completion of the digital construction task of double prevention mechanism by the end of the year
- The shell monitors the depth of the IBM MQ queue and scans it three times in 10s. When the depth value exceeds 5 for more than two times, the queue name and depth value are output.
- Huashu "deep learning" and code implementation: 01 Linear Algebra: basic concepts + code implementation basic operations
猜你喜欢
Solve the problem that PowerShell mining occupies 100% of cpu7 in win7
Shell script learning notes - regular expressions
Guangcheng cloud service can fill in a daily report regularly every day
Practical combat of industrial defect detection project (II) -- steel surface defect detection based on deep learning framework yolov5
基于ele封装下拉菜单等组件
Traversal of l2-006 tree (middle and later order determination binary tree & sequence traversal)
Sonic cloud real machine tutorial
Machine learning (Zhou Zhihua) Chapter 14 probability graph model
Domestic lightweight Kanban scrum agile project management tool
Niuke white moon race 6 [solution]
随机推荐
BLDC double closed loop (speed PI + current PI) Simulink simulation model
L2-006 樹的遍曆(中後序確定二叉樹&層序遍曆)
Innovation and management based on Scrum
leangoo脑图-共享式多人协作思维导图工具分享
The way to conquer C language
Plug in for vscode
Configuring Apache Web services for servers such as Tianyi cloud
Devil cold rice 𞓜 078 devil answers the market in Shanghai and Nanjing; Communication and guidance; Winning the country and killing and screening; The purpose of making money; Change other people's op
Kubernetes - Introduction to actual combat
Learn regular expression options, assertions
MySQL complex query uses temporary table / with as (similar to table variable)
Solve the problem that PowerShell mining occupies 100% of cpu7 in win7
Publish to NPM?
ele之Table表格的封装
Cloud computing learning 1 - openstack cloud computing installation and deployment steps with pictures and texts (Xiandian 2.2)
JZ76 删除链表中重复的结点
Regular object type conversion tool - Common DOM class
Classification of technology selection (2022)
Opencv fills the rectangle with a transparent color
Shell script learning notes -- shell operation on files sed