当前位置:网站首页>Q-Learning & Sarsa
Q-Learning & Sarsa
2022-04-23 02:56:00 【Live up to your youth】
Basic concepts
Markov decision process
Markov decision process is simply an agent (Agent) take action (Action) To change your state (State) Get rewards (Reward) And the environment (Environment) The cyclic process of interaction .
Markov decision process, which can be simply expressed as : M = < S , A , P ( s ′ ∣ s , a ) , R > M=<S, A, P(s'| s, a),R> M=<S,A,P(s′∣s,a),R>
among P ( s ′ ∣ s , a ) P(s'| s, a) P(s′∣s,a) Representation transformation model , If we choose an action in a certain state , We will enter the next state . This is usually expressed as a table P. If we are in a state s And we choose an action a, that s ′ s' s′ The probability of becoming the next state is P ( s ′ ∣ s , a ) P(s'| s, a) P(s′∣s,a).
Reinforcement learning
Simply speaking , Reinforcement learning is that individuals in the environment make a series of decisions according to certain strategies to complete a given task and get rewards , Find the optimal strategy to maximize the return .
Reinforcement learning is similar to dynamic programming , But unlike dynamic programming , Reinforcement learning repeatedly uses the experience of the previous learning process , Dynamic programming is just the opposite , Assume a full understanding of the environment in advance .
The model of reinforcement learning is shown above , Individuals give actions action, The environment gives the status state And reward reward For feedback ; Then the individual gives a new action action, The state of the environment state It will change again , There will be new rewards reward. This circular process is the idea of strengthening learning . So , Reinforcement learning is essentially a process in which individuals constantly try and make mistakes in the environment and use the delayed return of the environment to learn .
Q-Learning
Q-Learning Algorithm is one of the main algorithms of reinforcement learning , It provides a learning ability for agents to select the best action by using the experienced action sequence in Markov environment , The interaction process between agent and environment is regarded as a Markov decision process (MDP), According to the current state of the agent and the selected action , Determine a fixed state transition probability distribution 、 Next state and get a timely return . The goal is to find a strategy that maximizes future rewards .
Q-Learning The mathematical model used by the algorithm is shown in the figure above , In the right half of the formula , Q ( s , a ) Q(s,a) Q(s,a) Express Q The memory return in the table , r r r Indicates according to the action a a a And status s s s The current return , m a x Q ( s ′ , a ∗ ) maxQ(s',a^*) maxQ(s′,a∗) Next state and optional action set a ∗ a^* a∗ Select the largest Q Value corresponding a ′ a' a′. α \alpha α It means the learning rate , It can be seen from the formula that , The lower the learning rate , The more the robot cares about the return before , Instead of accumulating new returns . γ \gamma γ Represents the discount factor , Is a factor that considers the impact of future rewards on the present , It's a (0,1) Between the value of the .
Hypothetical use r + γ Q ( s ′ , a ∗ ) r +\gamma Q(s',a^*) r+γQ(s′,a∗) The real value of , Q ( s , a ) Q(s,a) Q(s,a) Represents the estimated return value , Then the model updating method : New returns = (1- α \alpha α) Estimated return + α \alpha α * Realistic return .
Q-learning The pseudo code of the algorithm is shown in the figure below :
Sarsa
Sarsa Algorithm and Q-Learning The algorithm is very similar ,‘sarsa’ The meaning of five letters is s( current state ),a( Current behavior ),r( Reward ),s( Next state ),a( The next step is ), In other words, we have thought of the current situation when we carry out this step s Corresponding a, And figured out the next one s’ and a’.Sarsa The mathematical model used is as follows :
Sarsa The pseudo-code of the algorithm is as follows :
Q-learning and Sarsa Different :
Q-learning Algorithm and Sarsa Algorithms are from the State s Start , Based on the current Q-table Use certain strategies ( ϵ − g r e e d y \epsilon-greedy ϵ−greedy) Choose an action a ′ a' a′, Then observe the next state s ′ s' s′, And again according to Q-table Choose action a ′ a' a′. Just choose between the two a’ Different methods . According to the algorithm description , Select a new state s ′ s' s′ The action of a ′ a' a′ when ,Q-learning Use greedy tactics ( ϵ − g r e e d y \epsilon-greedy ϵ−greedy), That is, select the one with the largest value a ′ a' a′, At this point, we just calculate which a ′ a' a′ You can make m a x Q ( s ′ , a ∗ ) maxQ(s',a^*) maxQ(s′,a∗) Take the maximum , Did not really use this action a ′ a' a′; and Sarsa Still use ϵ − g r e e d y \epsilon-greedy ϵ−greedy Strategy , And really adopted this action a ′ a' a′ .
版权声明
本文为[Live up to your youth]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220657127253.html
边栏推荐
- 【Hcip】OSPF常用的6种LSA详解
- 基于多态的职工管理系统源码与一些理解
- Get together to watch (detailed version) eat a few cents a day
- TypeScript(1)
- 重大危险源企业如何保障年底前完成双预防机制数字化建设任务
- Encapsulate components such as pull-down menu based on ele
- Source code and some understanding of employee management system based on polymorphism
- Linux redis - redis ha sentinel cluster construction details & redis master-slave deployment
- 《信息系统项目管理师总结》第六章 项目人力资源管理
- MySQL / SQL Server判断表或临时表存在则删除
猜你喜欢
Huashu "deep learning" and code implementation: 01 Linear Algebra: basic concepts + code implementation basic operations
Kubernetes study notes
Configuring Apache Web services for servers such as Tianyi cloud
Fashion MNIST 数据集分类训练
接口请求时间太长,jstack观察锁持有情况
Kubernetes - Introduction to actual combat
Actual combat of industrial defect detection project (IV) -- ceramic defect detection based on hrnet
Flink learning (XI) watermark
Cloud computing learning 1 - openstack cloud computing installation and deployment steps with pictures and texts (Xiandian 2.2)
Linux Redis ——Redis HA Sentinel 集群搭建详解 & Redis主从部署
随机推荐
JZ35 replication of complex linked list
php+mysql对下拉框搜索的内容修改
Windows MySQL 8 zip installation
The space between the left and right of the movie ticket seats is empty and cannot be selected
《信息系统项目管理师总结》第四章 项目成本管理
L2-006 树的遍历(中后序确定二叉树&层序遍历)
Kubernetes - detailed explanation of pod
Navicat failed to connect to Oracle Database: cannot load OCI DLL, 87: instant client package is
Modify the content of MySQL + PHP drop-down box
Processes and threads
学习正则表达式选项、断言
Slave should be able to synchronize with the master in tests/integration/replication-psync. tcl
Table space capacity query and expansion of Oracle Database
Log4j知识点记录
JS learning notes
JDBC JDBC
Huashu "deep learning" and code implementation: 01 Linear Algebra: basic concepts + code implementation basic operations
Devil cold rice 𞓜 078 devil answers the market in Shanghai and Nanjing; Communication and guidance; Winning the country and killing and screening; The purpose of making money; Change other people's op
Codeforces round 784 (Div. 4) (a - H)
Store consumption SMS notification template