当前位置:网站首页>AC & A2C & A3C
AC & A2C & A3C
2022-04-23 02:56:00 【Live up to your youth】
Basic concepts
Actor-Critic(AC)
AC The algorithm is based on both value function 、 Algorithm based on strategy function . The algorithm based on value function , It mainly refers to the value of all actions output by the algorithm itself , Choose the action according to the highest value , This kind of algorithm cannot select continuous actions . The algorithm based on value function refers to the probability that the algorithm itself outputs various actions to be taken in the next step , Then choose the action according to the probability .
A simple AC Algorithm (QAC) Strategy gradient learning is divided into two parts :
1、Critic: Use state action value function q ω ( s , a ) q_\omega(s,a) qω(s,a), Using linear Q- Value function ( ϕ ( s , a ) T ω \phi(s,a)^T\omega ϕ(s,a)Tω) Close to . And use T D ( λ ) TD(\lambda) TD(λ) To update the parameters ω \omega ω;
2、Actor: Use Critic Obtained value function q ω ( s , a ) q_\omega(s,a) qω(s,a) Boot policy function parameters θ Update .
QAC The pseudo code of the algorithm is shown in the figure below :
Advanced Actor-Critic(A2C)
stay AC in ,Critic Use the state action value function q ω ( s , a ) q_\omega(s,a) qω(s,a) The corresponding policy gradient is :
stay A2C in , Subtract the baseline function when defining the objective function B ( s ) B(s) B(s), This can reduce the variance . At this point, the policy gradient is :
take B ( s ) = v π θ ( s ) B(s)=v_{\pi_\theta}(s) B(s)=vπθ(s), A π θ ( s , a ) = q ω ( s , a ) − v π θ ( s ) A^{\pi_\theta}(s,a)=q_\omega(s,a)-v_{\pi_\theta}(s) Aπθ(s,a)=qω(s,a)−vπθ(s), among A π θ ( s , a ) A^{\pi_\theta}(s,a) Aπθ(s,a) It's called the dominance function , It represents a state - Action pairs relative to the average state - How good or bad the action is . If the dominance function is greater than zero , It means that the action is better than the average action , If the dominance function is less than zero , It means that the current action is not as good as the average action .
Actor In the parameter θ \theta θ The update method of is :
Critic Parameters are no longer used in ω \omega ω, Instead, use parameters v v v. Parameters v v v The update method of is :
among :
,
Asynchronous Advantage Actor-Critic(A3C)
版权声明
本文为[Live up to your youth]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220657127007.html
边栏推荐
- 《信息系统项目管理师总结》第四章 项目成本管理
- Reverse a linked list < difficulty coefficient >
- Source code and some understanding of employee management system based on polymorphism
- First day of rhcsa
- Close the computer port
- Encapsulate components such as pull-down menu based on ele
- The way to conquer C language
- Jz76 delete duplicate nodes in linked list
- Servlet template engine usage example
- Six very 6 computer driver managers: what software is good for driver upgrade? Recommended by the best computer driver management software abroad
猜你喜欢
Practical combat of industrial defect detection project (II) -- steel surface defect detection based on deep learning framework yolov5
How to use C language to realize [guessing numbers game]
Interpretation of the future development of smart agriculture
【Hcip】OSPF常用的6种LSA详解
The input of El input input box is invalid, and error in data(): "referenceerror: El is not defined“
Processes and threads
接口请求时间太长,jstack观察锁持有情况
Error installing Mongo service 'mongodb server' on win10 failed to start
Modification du contenu de la recherche dans la boîte déroulante par PHP + MySQL
php+mysql對下拉框搜索的內容修改
随机推荐
Android 高阶面试必问:全局业务和项目的架构设计与重构
1215_ Hello world used by scons
Interpretation of the future development of smart agriculture
ele之Table表格的封装
第46届ICPC亚洲区域赛(昆明) B Blocks(容斥+子集和DP+期望DP)
Traversal of l2-006 tree (middle and later order determination binary tree & sequence traversal)
Modify the content of MySQL + PHP drop-down box
Wepy learning record
【工欲善其事必先利其器】论文编辑及文献管理(Endnote,Latex,JabRef ,overleaf)资源下载及使用指南
[wechat applet] set the bottom menu (tabbar) for the applet
基于ele封装下拉菜单等组件
Shell script learning notes - regular expressions
Win view port occupation command line
Store consumption SMS notification template
The penultimate K nodes in jz22 linked list
Codeforces round 784 (Div. 4) (a - H)
leangoo脑图-共享式多人协作思维导图工具分享
The space between the left and right of the movie ticket seats is empty and cannot be selected
国产轻量级看板式Scrum敏捷项目管理工具
JS relearning