当前位置：网站首页>Q-Learning ＆ Sarsa

Q-Learning ＆ Sarsa

2022-04-23 02:56:00 【Live up to your youth】

Basic concepts

Markov decision process

Markov decision process is simply an agent （Agent） take action （Action） To change your state （State） Get rewards （Reward） And the environment （Environment） The cyclic process of interaction .

Markov decision process, which can be simply expressed as ： $M = < S, A, P (s^{'} ∣ s, a), R >$

among $P (s^{'} ∣ s, a)$ Representation transformation model , If we choose an action in a certain state , We will enter the next state . This is usually expressed as a table P. If we are in a state s And we choose an action a, that $s^{'}$ The probability of becoming the next state is $P (s^{'} ∣ s, a)$ .

Reinforcement learning

Simply speaking , Reinforcement learning is that individuals in the environment make a series of decisions according to certain strategies to complete a given task and get rewards , Find the optimal strategy to maximize the return .

Reinforcement learning is similar to dynamic programming , But unlike dynamic programming , Reinforcement learning repeatedly uses the experience of the previous learning process , Dynamic programming is just the opposite , Assume a full understanding of the environment in advance .
Insert picture description here
The model of reinforcement learning is shown above , Individuals give actions action, The environment gives the status state And reward reward For feedback ; Then the individual gives a new action action, The state of the environment state It will change again , There will be new rewards reward. This circular process is the idea of strengthening learning . So , Reinforcement learning is essentially a process in which individuals constantly try and make mistakes in the environment and use the delayed return of the environment to learn .

Q-Learning

Q-Learning Algorithm is one of the main algorithms of reinforcement learning , It provides a learning ability for agents to select the best action by using the experienced action sequence in Markov environment , The interaction process between agent and environment is regarded as a Markov decision process （MDP）, According to the current state of the agent and the selected action , Determine a fixed state transition probability distribution 、 Next state and get a timely return . The goal is to find a strategy that maximizes future rewards .
Insert picture description here
Q-Learning The mathematical model used by the algorithm is shown in the figure above , In the right half of the formula , $Q (s, a)$ Express Q The memory return in the table , $r$ Indicates according to the action $a$ And status $s$ The current return , $maxQ(s',a^*)$ Next state and optional action set $a^*$ Select the largest Q Value corresponding $a^{'}$ . $\alpha$ It means the learning rate , It can be seen from the formula that , The lower the learning rate , The more the robot cares about the return before , Instead of accumulating new returns . $\gamma$ Represents the discount factor , Is a factor that considers the impact of future rewards on the present , It's a (0,1) Between the value of the .

Hypothetical use $+\gamma Q(s',a^*)$ The real value of , $Q (s, a)$ Represents the estimated return value , Then the model updating method ： New returns = (1- $\alpha$ ) Estimated return + $\alpha$ * Realistic return .

Q-learning The pseudo code of the algorithm is shown in the figure below ：
Insert picture description here

Sarsa

Sarsa Algorithm and Q-Learning The algorithm is very similar ,‘sarsa’ The meaning of five letters is s( current state ),a( Current behavior ),r( Reward ),s( Next state ),a( The next step is ), In other words, we have thought of the current situation when we carry out this step s Corresponding a, And figured out the next one s’ and a’.Sarsa The mathematical model used is as follows ：
Insert picture description here
Sarsa The pseudo-code of the algorithm is as follows ：

Q-learning and Sarsa Different ：

Q-learning Algorithm and Sarsa Algorithms are from the State s Start , Based on the current Q-table Use certain strategies （ $\epsilon-greedy$ ） Choose an action $a^{'}$ , Then observe the next state $s^{'}$ , And again according to Q-table Choose action $a^{'}$ . Just choose between the two a’ Different methods . According to the algorithm description , Select a new state $s^{'}$ The action of $a^{'}$ when ,Q-learning Use greedy tactics （ $\epsilon-greedy$ ）, That is, select the one with the largest value $a^{'}$ , At this point, we just calculate which $a^{'}$ You can make $maxQ(s',a^*)$ Take the maximum , Did not really use this action $a^{'}$ ; and Sarsa Still use $\epsilon-greedy$ Strategy , And really adopted this action $a^{'}$ .