The notes are summary for the course Reinforcement Learning at University Stuttgart in SS23.
Characteristics of reinforcement learning
- no supervisor, only a reward signal
- Feedback is (often) delayed, non instantaneous -
- Time really matters (sequential, non i.i.d data)
- Agent’s actions affect the subsequent data it receives
Exploitation vs. exploration
- Exploration: finds more information about the environment
- Exploitation: exploits known information to maximise reward
- Too much exploration could waste resources, too much exploitation could limit learning an lead to a suboptimal results.
- give 2 examples for them:
- Dining: go to your favourite restaurant vs. try something new
- Advertisment: place a new advert vs. the most relevant
Estimating action-values
Sample average method:
在强化学习中,Sample Average Method(样本平均法)是一种估计动作值函数的方法。它通过计算动作的平均奖励来估计动作的价值。在这种方法中,每次选择某个动作后,将其奖励值加入该动作的历史奖励值中,并将结果除以该动作被选择的次数来计算其平均奖励值。由于在每个时间步上都会更新所有动作的值,因此Sample Average Method通常用于非常小的动作空间。相对于其他估计动作值函数的方法,如TD-learning和Q-learning,样本平均法的收敛速度较慢,但是对于某些问题来说,仍然是一种有效的方法。
- What is the difference between
and is the true value of a is the value of greedy action at time t
这只是估计action value的一种很直接的方式, 但不一定是最佳的方式。这种方式直接基于历史的奖励值来估计其动作的价值。
Action selction
在得到 action-state function
-greedy action selection
- With probability
take the greedy action (exploitation) - With probability
take the random action (exploration) - This selection method prefer to choose exploitation action, so it calls
-greedy action selection.
-greedy vs greedy
Definition: This is a simple idea to force continued exploration. The Algorithm will take the greedy action with probability
For epsilon greedy and greedy method, which one is better in each of these cases?
- What if reward variance is very small, e.g. zero? choose greedy method
- If the variance in reward is very small or zero, this indicates that the environment is deterministic. In this case, a greedy method might perform better because once the agent has learned the optimal action, there's no benefit from further exploration. Since the rewards for each action don't vary, the agent can quickly learn the best action and always select it.
- What if reward variance is larger? choose
- greedy method - When there's a larger variance in reward, the epsilon-greedy method could be a better choice. In environments with high variance, a single action might sometimes yield a high reward and sometimes a low reward. The epsilon-greedy strategy's occasional exploration can help confirm or update the agent's knowledge about the value of each action.
- What if task is non-stationary? choose
- greedy method - For non-stationary tasks, where the environment and the associated rewards can change over time, an epsilon-greedy approach would likely perform better. This is because constant exploration (which epsilon-greedy provides) allows the agent to keep learning about the environment and adapt to changes. The greedy method might fail to adapt in this case because it tends to stick with the action that was optimal in the past.
Softmax action selection
- Why we need softmax action selection?
- Because the worst action has same probability as second-best action.
具体来说,Softmax action selection使用softmax函数将每个行动的预测值转换为概率,然后基于这些概率进行随机选择。softmax函数的作用是将任何一组实数转换为介于0和1之间的概率分布,它的输出值是一个概率向量,其中每个元素都是介于0和1之间的数字,这些数字的总和为1。
这样的好处是如果只用Sample average method计算的Q值, 如果Q之间的值差别特别大的时候,也就是动作空间非常大,不同的action造成的后果差别非常大,因此用softmax将动作空间归一化,以提高准确度。
Effect of temperature:
- why we need hyperparameter temperature in softmax?
- It's used to keep balance between exploitation and exploration.
- as
, high temperature, the softmax function provides a uniform distribution, which means every action has the same probability causing the agent to explore the environment more. - as
, low temperature, softmax function provides a distribution which focus on the action with highest value. This encourages exploitation rather than exploration.
Incremental action-value estimates
- Why we need incremental action-value estimates?
- Because this method requires less memory and computation complexity.
- Running time is
- Running time is
- For the original formula to compute action-value, the memory and computational requirement will grow over time. Each additional reward would require additional memory to store it and additional computation to compute the sum in the numerator.
- Running time is
有别于Sample average method,incremental 只计算增量,类似于迭代的过程,会大大降低复杂度.
- Running time is
- Because this method requires less memory and computation complexity.
This makes it a very efficient method for reinforcement learning, particularly in environments with large state and action spaces.
Stepsize depends on
- What is the implication of keeping α constant?
- gives more weights to recent rewards
- consider non-stationary environments, the constant stepsize allow the agent focus on the most recent trends.
Winner would be red one (ε = 0.01) because it converges to 1 - ε = 0.99. It is 0.09 greater than blue one (ε = 0.1) which converges to 1 - ε = 0.9.