eth slide

From bandits to Markov Decision Processes

Pasted image 20230729191740.png|500

Agent - Environment Interaction Loop

Pasted image 20230729192719.png|400
This image describes the structure of Agent-Environment Loop which can be seen as the basic of reinforcement learning.

Goals, Rewards and Returns

Rewards

对于Reward有不同的定义:
Pasted image 20230422183944.png|500

他们是包含关系,比如r(s,a)如果只考虑现在的状态为s,那么从上一步能到s的transition有很多,s1,s2,s3... 都可以到现在的状态s,而r(s,a,s) 只考虑一种。

!Example for Reward

Returns

Episodic and non-episodic returns

任务通常被分为两种类型:Episode任务和Continuing任务。这两种任务类型区别在于环境和智能体的互动是否具有明确的开始和结束。

  1. Episode任务:Episode任务有明确的开始和结束点,每个开始和结束的周期被称为一个Episode。在每个Episode结束后,智能体通常会被重置到一种初始状态或初始状态集,然后开始新的Episode。这种任务类型的回报(returns)是Episode性的(episodic),因为它们是在一个特定的Episode内计算的。

  2. Continuing任务:对于Continuing任务,智能体和环境的互动是连续的,没有明确的开始和结束点。这种类型的任务通常需要引入折扣因子来确保未来回报的总和是有限的。这种任务类型的回报是非Episode性的(non-episodic)。

例如,棋类游戏通常被视为Episode任务,因为每局游戏有明确的开始(棋盘初始状态)和结束(一方获胜或和棋)。另一方面,股票投资可能被视为Continuing任务,因为投资决策和结果是连续不断的,没有明确的结束点。

Policy and Value functions

Policy

A policy g is a strategy that the agent uses to decide which action to take at each state which means a mapping from states to actions.

Almost all reinforcement learning use value-funtion/ state-value function to estimate how good the policy is. And how good is defined in terms of expected returns.

  1. Deterministic Policy: A deterministic policy provides a specific action for each state. This can be represented as: $$a = \pi(s)$$This formula means that, given state s, the policy π will always choose action a.
  2. Stochastic Policy: A stochastic policy provides a probability distribution over possible actions for each state. This can be represented as:$$\pi(a\mid s)=\operatorname*{Pr}\left{A_{t}=a\mid S_{t}=s\right}$$

Value function

分为两种 state value function and action value function

为什么要定义这么两个概念,不就是在状态s多执行了一个动作a么?这样做有什么意义呢?
这是两个概念,分别是状态价值V(s)和动作价值Q(s,a),前者是对环境中某一个状态的价值大小做的评估,后者是对在某状态下的动作的价值大小的评估。概念类似,但主要区别应该是体现在用途以及算法上吧,比如对于离散型的动作空间,可以单纯基于动作Q值去寻优(DQN算法),如果是动作空间巨大或者动作是连续型的,那么可以判断状态价值并结合策略梯度来迭代优化(AC算法)
节选自知乎

State-value function:

Definition: the expected reward of a state under a given policy. It will be used to evaluate how good is the state.

vπ(s)=Eπ[GtSt=s]=E[Rt+1+γRt+2+γ2Rt+3+St=s]

为什么这里的state value需要求一个期望?因为策略 π 本身是一个概率函数,比如我当前的策略是让机器人右转,如果当前是确定性策略,那么机器人右转的概率就是1,如果选择了随机性概率,那么右转的概率为0.8,剩下有0.2的概率左转;因此需要考虑到期望。

继续推导state-value function公式:$$\begin{aligned}
v_{\pi}(s) & =\mathbb{E}{\pi}\left[G \mid S_{t}=s\right] \
& =\mathbb{E}{\pi}\left[R+\gamma G_{t+1} \mid S_{t}=s\right] \
& =\sum_{a} \pi(a \mid s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \mathbb{E}{\pi}\left[G \mid S_{t+1}=s^{\prime}\right]\right] \
& =\sum_{a} \pi(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right] \text { for all } s \in \mathcal{S}
\end{aligned}$$

等号2到等号3是如何实现的?
参考知乎回答
Pasted image 20230526133946.png|600

Action value function

qπ(s,a)=Eπ[Gt|St=s,At=a]

继续推导公式$$\begin{aligned}
q_{\pi}(s, a) & =\mathbb{E}{\pi}\left[G \mid S_{t}=s, A_{t}=a\right] \
& =\mathbb{E}{\pi}\left[\sum^{\infty} \gamma^{i} R_{t+i+1} \mid S_{t}=s, A_{t}=a\right] \
& =\mathbb{E}{\pi}\left[R+\gamma G_{t+1} \mid S_{t}=s, A_{t}=a\right] \
& =\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \mathbb{E}{\pi}\left[G \mid S_{t+1}=s^{\prime}, A_{t+1}=a^{\prime}\right]\right] \
& =\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \sum_{a^{\prime}} \pi\left(a^{\prime} \mid s^{\prime}\right) q_{\pi}\left(s^{\prime}, a^{\prime}\right)\right]
\end{aligned}$$

Relationship between vπ and qπ

e9f3dbbbacba337161eda5712585364.jpg
最后可以得到两者的关系: $$\begin{array}{c}{{\displaystyle v_{\pi}(s)=E_{\pi}\left[G_{t}\mid S_{t}=s\right]}}\ {{\displaystyle=\sum_{a}\pi\big(a|s\big)q_{\pi}(s,a)}}\end{array}$$

Backup diagram

c02bfb6155efc83dcf44f38e101cc8a.jpg

Bellman equation

价值与贝尔曼方程

Transition Matrix

Pasted image 20230422185600.png

Bellman equation in matrix form

The Bellman equation can be expressed concisely using matrices:

V=R+γPV

然后可以直接得到线性解:$$V = (1-\gamma P)^{-1}R$$ Pasted image 20230730110550.png|400
Computational complexity is O(n3) for n states

Bellman equation for optimal value

The core formula is: $$v_{}(s)=,\mathrm{max},q_{}(s,a)$$

Optimal equation for v

Pasted image 20230430230905.png|400

注意这里是以action为对比项, 而不是以state项!

optimal action-value function

q(s,a)=Eπ[Rt+1+γmaxaq(St+1,a)St=s,At=a]=s,rp(s,rs,a)[r+γmaxq(s,a)]

!Backup diagram for optimal value function
!Backup diagram for optimal action-value function
File_002.png

注意这里的的两个max并不是一样的位置, 对于state value function而言, 取的是q的最大值, 因此应在最外层也就是第一层; 而对于state action value 而言, 同样取的是q的最大值, 而此时的q在第二层, 所以这个max要在括号里面.

Summary

Pasted image 20230730105916.png

Exercise 2

  1. Pasted image 20230808115704.png Pasted image 20230808115713.png

  2. Pasted image 20230808120412.png Pasted image 20230808120444.png用到了这个公式: $$G_{t}=\sum_{k=0}^{\infty}\gamma^{k}={\frac{1}{1-\gamma}}$$

  3. Pasted image 20230808121857.png

E[Rt+1]St=s]=aπ(a|St)s,rp(s,r|s,a)r
  1. Pasted image 20230808124233.png Pasted image 20230808124245.png
  2. Pasted image 20230808125353.png
  3. Pasted image 20230808130240.pngNow it changes to a episode task not a continue task, we could calculate the new value function as followed: Pasted image 20230808152933.pngThis means every new state value function will not add a same constant term rather term with respect to episode length T.
  4. Pasted image 20230808153623.png File_001.png
  5. Pasted image 20230807165436.png
  6. μ is probabilistic policy that is greedy, π is a deterministic policy, prove vμ(s)vπ(s) for all state s b2ef2f43b7c925b19ed5879d32eb797.png