当前位置:网站首页>【深入浅出强化学习】2 马尔可夫决策过程
【深入浅出强化学习】2 马尔可夫决策过程
2022-04-22 22:02:00 【Zhao-Jichao】
2.1 马尔科夫决策过程理论讲解
马尔科夫决策过程由元组( S , A , P , R , γ S, A, P, R, \gamma S,A,P,R,γ)描述,其中:
S S S 为有限的状态集
A A A 为有限的动作集
P P P 为状态转移概率
R R R 为回报函数
γ \gamma γ 为折扣因子,用来计算累积回报。
强化学习的目标是给定一个马尔科夫决策过程,寻找最优策略。
所谓策略是指状态到动作的映射,策略常用符号 π \pi π 表示,它是指给定状态 s s s 时,动作集上的一个分布,即
π ( a ∣ s ) = p [ A t = a ∣ S t = s ] (2.1) \pi(a | s) = p[A_t = a | S_t = s] \tag{2.1} π(a∣s)=p[At=a∣St=s](2.1)
公式(2.1)的含义是:策略 π \pi π 在每个状态 s s s 指定一个动作概率。如果给出的策略 π \pi π 是确定性的,那么策略 π \pi π 在每个状态 s s s 指定一个确定的动作。
当给定一个策略 π \pi π 时,就可以计算累积回报了。
定义累积回报:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 (2.2) G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k = 0}^{\infty} \gamma^k R_{t+k+1} \tag{2.2} Gt=Rt+1+γRt+2+γ2Rt+3+⋯=k=0∑∞γkRt+k+1(2.2)
累积回报 G 1 G_1 G1 是个随机变量,不是一个确定值,因此无法描述,但其期望是个确定值,可以作为状态值函数的定义。
当智能体采用策略 π \pi π 时,累积回报服从一个分布,累积回报在状态 s s s 处的期望值定义为状态-值函数:
ν π ( s ) = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] (2.3) \nu_\pi (s) = E_\pi [ \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s ] \tag{2.3} νπ(s)=Eπ[k=0∑∞γkRt+k+1∣St=s](2.3)
注意:状态值函数是与策略 π \pi π 相对应的,这是因为策略 π \pi π 决定了累积回报 G G G 的状态分布。
状态-行为值函数为
q π ( s , a ) = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] (2.4) q_\pi (s,a) = E_\pi [ \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s, A_t = a ] \tag{2.4} qπ(s,a)=Eπ[k=0∑∞γkRt+k+1∣St=s,At=a](2.4)
2.3 基于 gym 的 MDP 实例讲解
Appendix
grid_mdp.py 文件代码
import logging
import numpy
import random
from gym import spaces
import gym
logger = logging.getLogger(__name__)
class GridEnv(gym.Env):
metadata = {
'render.modes': ['human', 'rgb_array'],
'video.frames_per_second': 2
}
def __init__(self):
self.states = [1,2,3,4,5,6,7,8] #状态空间
self.x=[140,220,300,380,460,140,300,460]
self.y=[250,250,250,250,250,150,150,150]
self.terminate_states = dict() #终止状态为字典格式
self.terminate_states[6] = 1
self.terminate_states[7] = 1
self.terminate_states[8] = 1
self.actions = ['n','e','s','w']
self.rewards = dict(); #回报的数据结构为字典
self.rewards['1_s'] = -1.0
self.rewards['3_s'] = 1.0
self.rewards['5_s'] = -1.0
self.t = dict(); #状态转移的数据格式为字典
self.t['1_s'] = 6
self.t['1_e'] = 2
self.t['2_w'] = 1
self.t['2_e'] = 3
self.t['3_s'] = 7
self.t['3_w'] = 2
self.t['3_e'] = 4
self.t['4_w'] = 3
self.t['4_e'] = 5
self.t['5_s'] = 8
self.t['5_w'] = 4
self.gamma = 0.8 #折扣因子
self.viewer = None
self.state = None
def getTerminal(self):
return self.terminate_states
def getGamma(self):
return self.gamma
def getStates(self):
return self.states
def getAction(self):
return self.actions
def getTerminate_states(self):
return self.terminate_states
def setAction(self,s):
self.state=s
def _step(self, action):
#系统当前状态
state = self.state
if state in self.terminate_states:
return state, 0, True, {
}
key = "%d_%s"%(state, action) #将状态和动作组成字典的键值
#状态转移
if key in self.t:
next_state = self.t[key]
else:
next_state = state
self.state = next_state
is_terminal = False
if next_state in self.terminate_states:
is_terminal = True
if key not in self.rewards:
r = 0.0
else:
r = self.rewards[key]
return next_state, r,is_terminal,{
}
def _reset(self):
self.state = self.states[int(random.random() * len(self.states))]
return self.state
def render(self, mode='human', close=False):
if close:
if self.viewer is not None:
self.viewer.close()
self.viewer = None
return
screen_width = 600
screen_height = 400
if self.viewer is None:
from gym.envs.classic_control import rendering
self.viewer = rendering.Viewer(screen_width, screen_height)
#创建网格世界
self.line1 = rendering.Line((100,300),(500,300))
self.line2 = rendering.Line((100, 200), (500, 200))
self.line3 = rendering.Line((100, 300), (100, 100))
self.line4 = rendering.Line((180, 300), (180, 100))
self.line5 = rendering.Line((260, 300), (260, 100))
self.line6 = rendering.Line((340, 300), (340, 100))
self.line7 = rendering.Line((420, 300), (420, 100))
self.line8 = rendering.Line((500, 300), (500, 100))
self.line9 = rendering.Line((100, 100), (180, 100))
self.line10 = rendering.Line((260, 100), (340, 100))
self.line11 = rendering.Line((420, 100), (500, 100))
#创建第一个骷髅
self.kulo1 = rendering.make_circle(40)
self.circletrans = rendering.Transform(translation=(140,150))
self.kulo1.add_attr(self.circletrans)
self.kulo1.set_color(0,0,0)
#创建第二个骷髅
self.kulo2 = rendering.make_circle(40)
self.circletrans = rendering.Transform(translation=(460, 150))
self.kulo2.add_attr(self.circletrans)
self.kulo2.set_color(0, 0, 0)
#创建金条
self.gold = rendering.make_circle(40)
self.circletrans = rendering.Transform(translation=(300, 150))
self.gold.add_attr(self.circletrans)
self.gold.set_color(1, 0.9, 0)
#创建机器人
self.robot= rendering.make_circle(30)
self.robotrans = rendering.Transform()
self.robot.add_attr(self.robotrans)
self.robot.set_color(0.8, 0.6, 0.4)
self.line1.set_color(0, 0, 0)
self.line2.set_color(0, 0, 0)
self.line3.set_color(0, 0, 0)
self.line4.set_color(0, 0, 0)
self.line5.set_color(0, 0, 0)
self.line6.set_color(0, 0, 0)
self.line7.set_color(0, 0, 0)
self.line8.set_color(0, 0, 0)
self.line9.set_color(0, 0, 0)
self.line10.set_color(0, 0, 0)
self.line11.set_color(0, 0, 0)
self.viewer.add_geom(self.line1)
self.viewer.add_geom(self.line2)
self.viewer.add_geom(self.line3)
self.viewer.add_geom(self.line4)
self.viewer.add_geom(self.line5)
self.viewer.add_geom(self.line6)
self.viewer.add_geom(self.line7)
self.viewer.add_geom(self.line8)
self.viewer.add_geom(self.line9)
self.viewer.add_geom(self.line10)
self.viewer.add_geom(self.line11)
self.viewer.add_geom(self.kulo1)
self.viewer.add_geom(self.kulo2)
self.viewer.add_geom(self.gold)
self.viewer.add_geom(self.robot)
if self.state is None: return None
#self.robotrans.set_translation(self.x[self.state-1],self.y[self.state-1])
self.robotrans.set_translation(self.x[self.state-1], self.y[self.state- 1])
return self.viewer.render(return_rgb_array=mode == 'rgb_array')
版权声明
本文为[Zhao-Jichao]所创,转载请带上原文链接,感谢
https://zhaojichao.blog.csdn.net/article/details/124340164
边栏推荐
- JS intercept file suffix
- There's no need for real people to show their faces and shoot videos. Here's the method. Do we media for 20 days 4561
- 阿里云日志服务sls的典型应用场景
- JVM性能调优1
- Crashsight general function & feature function introduction
- Overview of working principle and main characteristics of ATOS proportional valve
- Borui data and F5 jointly build the full data chain DNA of financial technology from code to user
- 多层感知机的从零开始实现( 从D2L 包中抽取函数)
- Weekly Q & A highlights: is polardb-x fully compatible with MySQL?
- Dynamic memory in C language
猜你喜欢

2.59-编写一个C表达式,它生成一个字,由x的最低有效字节和y中剩下的字节组成。对于运算数x =0x89ABCDEF和y=0x76543210,就得到0x765432EF.

Vulnerability utilization in 2021: 95% of traditional vulnerabilities have been disclosed

MySql--- 数据类型

How to use lightly to teach programming classes gracefully?

Best buy website EDI test process

【MMUB】基于Hidden Markov model的手机用户行为建模——Hidden Markov model

初学单片机点亮第一个外设--LED灯

They are all intelligent in the whole house. What's the difference between aqara and homekit?

高端啤酒正在失去年轻人

Transport layer - overview of transport layer (1)
随机推荐
Borui data and F5 jointly build the full data chain DNA of financial technology from code to user
处理用逗号分隔的字符串 并按字典升序排序输出
Overview of working principle and main characteristics of ATOS proportional valve
安装AuthorizationPolicy和EnvoyFilter
Flex layout
How should enterprises make disaster recovery plans for Cloud Computing?
2021下半年软件设计师上午真题及答案解析
Metawork: please, this remote pairing programming is cool!
PHP one-dimensional array de duplication
Text processing mode out of bootrap box
[内网渗透]——VulnStack (一)
KunlunDB对MySQL私有DML语法的支持
条件编译分析及使用
Node error reporting record: cannot find module are we there yet \ index js
[untitled]
The head is fixed on the ceiling, less than one screen of footer is fixed on the bottom, and more than one screen scrolls. Pay attention to the notes
论文笔记: BRITS: Bidirectional Recurrent Imputation for Time Series
二分法应用:875. 爱吃香蕉的珂珂
一个简单易用的文件上传方案
意大利阿托斯电磁阀的工作性质是什么?