当前位置:网站首页>Depth deterministic strategy gradient (ddpg)
Depth deterministic strategy gradient (ddpg)
2022-04-23 02:56:00 【Live up to your youth】
Basic concepts
Discrete action & Continuous action
Discrete actions are actions that can be classified , For example, go to 、 Next 、 Left 、 Right 、 Jump and so on , Generally, multi classification activation function is used softmax These actions mean . If there are only two actions , You can use sigmoid Activation function .
A continuous action is a continuous value , Like speed 、 angle 、 Force, etc. denote the exact value . Continuous action is not classifiable , Generally, activation functions of return value type are used to represent them , such as tanh function .
Pictured , If we want to use reinforcement learning to train a strategy to control the manipulator , The upper shaft can be in [0, 2 π \pi π] Turn between , The lower axis can be in [0, π \pi π] Turn between , Then its action space will be a multi-dimensional continuous space .
DDPG
DQN The algorithm uses neural networks to approximate the representation of Q- function , Successfully solved the problem of high-dimensional state space . however DQN Can only handle discrete 、 Low dimensional action space , For many in life, there is continuity 、 Control tasks in high-dimensional action space ,DQN Unable to deal with .
DDPG The algorithm is proposed to solve such a problem , You can be DQN An improved version of the algorithm . stay DQN in , use ϵ − g r e e d y \epsilon-greedy ϵ−greedy Strategy or Boltzamann Distribute strategies to select actions a; and DDPG A neural network is used to fit the strategy function , Direct output action a, It can deal with the output of continuous action and high-dimensional action space . therefore ,DDPG It can be regarded as a continuous action space DQN.
DDPG The training process is as follows , Be similar to DQN,DDPG Added policy network (Actor).DDPG It has two parts , A strategic network (Actor), A value network (Critic). Strategy network output action , Value networks judge actions . Both have their own ways to update information . The strategy network is updated by gradient calculation formula , And the value network is updated according to the target value .DDPG Update and of value network DQN similar , Here is the update of the policy network in detail .
Strategy gradient formula :
This formula is easy to understand , For example, for the same state , We output two different actions a1 and a2, Two feedback results are obtained from the state estimation network Q value , Namely Q1 and Q2, hypothesis Q1>Q2, Take action a1 You can get more rewards . So what is the idea of a strategy gradient ? Is to increase a1 Probability , Reduce a2 Probability , in other words ,Actor I want to get as big as possible Q value . So our Actor Loss can be simply interpreted as feedback Q The greater the value, the smaller the loss , Feedback received Q The smaller the value, the greater the loss .
DDPG The algorithm is as follows :
版权声明
本文为[Live up to your youth]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220657127089.html
边栏推荐
- Encapsulation of ele table
- Processes and threads
- AC380V drop 5v12v24v200ma, UHV non isolated chip IC scheme
- Traversée de l'arbre L2 - 006
- Leangoo brain map - shared multi person collaborative mind mapping tool
- Mosaic Routing: implement / home / news
- Shell script learning notes -- shell operation on files sed
- windows MySQL8 zip安装
- First knowledge of C language ~ branch statements
- Actual combat of industrial defect detection project (IV) -- ceramic defect detection based on hrnet
猜你喜欢
Guangcheng cloud service can fill in a daily report regularly every day
Innovation and management based on Scrum
Domestic lightweight Kanban scrum agile project management tool
Android high-level interview must ask: overall business and project architecture design and reconstruction
Source code and some understanding of employee management system based on polymorphism
Fashion MNIST 数据集分类训练
工业互联网+危化安全生产综合管理平台怎样建
Devil cold rice 𞓜 078 devil answers the market in Shanghai and Nanjing; Communication and guidance; Winning the country and killing and screening; The purpose of making money; Change other people's op
机器学习(周志华) 第十四章概率图模型
接口请求时间太长,jstack观察锁持有情况
随机推荐
Basic workflow of CPU
Mosaic Routing: implement / home / news
The input of El input input box is invalid, and error in data(): "referenceerror: El is not defined“
ROP Emporium x86_ 64 7 ~ 8 questions
基于多态的职工管理系统源码与一些理解
Centos7 install MySQL 8 0
The second day of learning rhcsa
Probabilistic model of machine learning
When using art template inheritance, compileerror: invalid or unexpected token generated
Log cutting - build a remote log collection server
OCR识别PDF文件
[if you want to do a good job, you must first use its tools] Guide for downloading and using paper editing and document management (endnote, latex, jabref, overflow) resources
Classification and regression tree of machine learning
Leangoo brain map - shared multi person collaborative mind mapping tool
Regular object type conversion tool - Common DOM class
Cloud computing learning 1 - openstack cloud computing installation and deployment steps with pictures and texts (Xiandian 2.2)
Kubernetes study notes
Chapter VII project communication management of information system project manager summary
JZ76 删除链表中重复的结点
Android high-level interview must ask: overall business and project architecture design and reconstruction