Fundamentals of Reinforcement Learning
In RL, the main goal is to learn from interaction. We want agents to learn a behavior, a way of selecting actions in given situations, to achieve some goal. The main difference between classical programming or planning is that we do not want to code the planning software explicitly on our own, as this would require a great effort; it can be very inefficient and even impossible. The RL discipline was born precisely for this reason.
RL agents start (usually) with no idea of what to do. They typically do not know the goal, they do not know the game's rules, and they do not know the dynamics of the environment or how their actions influence the state.
There are three main components of RL: perception, actions, and goals.
Agents should be able to perceive the current environment state to deal with a task. This perception, also called observation, might be different from the actual environment state, can be subject to noise, or can be partial.
For example, think of a robot moving in an unknown environment. For robotic applications, usually, the robot perceives the environment using cameras. Such a perception does not represent the environment state completely; it can be subject to occlusions, poor lighting, or adverse conditions. The system should be able to deal with this incomplete representation and learn a way of moving in the environment.
The other main component of an agent is the ability to act; the agent should be able to take actions that affect the environment state or the agent's state.
Agents should also have a goal defined through the environment state. Goals are described using high-level concepts such as winning a game, moving in an environment, or driving correctly.
One of the challenges of RL, a challenge that does not arise in other types of learning, is the exploration-exploitation trade-off. In order to improve, the agent has to exploit its knowledge; it should prefer actions that have demonstrated themselves as useful in the past. There's a problem here: to discover better actions, the agent should continue exploring, trying moves they have never done before. To estimate the effect of an action reliably, an agent has to perform each action many times. The critical thing to notice here is that neither exploration nor exploitation can be performed inpidually in order to learn a task.
The aforementioned is very similar to the challenges we face as babies when we have to learn how to walk. At first, we try different types of movement, and we start from a simple movement yielding satisfactory results: crawling. Then, we want to improve our behavior to become more efficient. To learn a new behavior, we have to do movements we never did before: we try to walk. At first, we perform different actions yielding unsatisfactory results: we fall many times. Once we discover the correct way of moving our legs and balancing our body, we become more efficient in walking. If we did not explore further and we stopped at the first behavior that yields satisfactory results, we would crawl forever. By exploring, we learn that there can be different behaviors that are more efficient. Once we learn how to walk, we can stop exploring, and we can start exploiting our knowledge.
Elements of RL
Let's introduce the main elements of the RL framework intuitively.
Agent
In RL, the agent is the abstract concept of the entity that moves in the world, takes actions, and achieves goals. An agent can be a piece of autonomous driving software, a chess player, a Go player, an algorithmic trader, or a robot. The agent is everything that can perceive and influence the state of the environment and, therefore, can be used to accomplish goals.
Actions
An agent can perform actions based on the current situation. Actions can assume different forms depending on the specific task.
Actions can be to steer, to push the accelerator pedal, or to push the brake pedal in an autonomous driving context. Other examples of actions include moving the horse to the H5 position or moving the king to the A5 position in a chess context.
Actions can be low-level, such as controlling the voltage of the motors of a vehicle, but they can also be high-level, or planning actions, such as deciding where to go. The decision on the action level is the responsibility of the algorithm's designer. Actions that are too high-level can be challenging to implement at a lower level; they might require extensive planning at lower levels. At the same time, low-level actions make the problem difficult to learn.
Environment
The environment represents the context in which the agent moves and takes decisions. An environment is composed of three main elements: states, dynamics, and rewards. They can be explained as follows:
- State: This represents all of the information describing the environment at a particular timestep. The state is available to the agent through observations, which can be a partial or full representation.
- Dynamics: The dynamics of an environment describe how actions influence the state of the environment. The environment dynamic is usually very complex or unknown. An RL algorithm using the information of the environment dynamic to learn how to achieve a goal belongs to the category of model-based RL, where the model represents the mathematical description of the environment. Most of the time, the environment dynamic is not available to the agent. In this case, the algorithm belongs to the model-free category. Even if the environment model is not available, too complicated, or too approximated, the agent can learn a model of the environment during training. Also, in this case, the algorithm is said to be model-based.
- Rewards: Rewards are scalar values associated with each timestep describing the agent's goal. Rewards can also be described as environmental feedback, providing information to an agent about its behavior; it is, therefore, necessary for making learning possible. If the agent receives a high reward, it means that it performed a good move, a move bringing it closer to its goal.
Policy
A policy describes the behavior of the agent. Agents select actions by following their policies. Mathematically, a policy is a function mapping states to actions. What does this mean? Well, it means that the input of the policy is the current state, and its output is the action to take. A policy can have different forms. It can be a simple set of rules, a lookup table, a neural network, or any function approximator. A policy is the core of the RL framework, and the goal of all RL algorithms (implicit or explicit) is to improve the agent's policy to maximize the agent's performance on a task (or on a set of tasks). A policy can be stochastic, involving a distribution over actions, or it can be deterministic. In the latter case, the selected action is uniquely determined by the environment's state.
An Example of an Autonomous Driving Environment
To better understand the environment's role and its characteristics in the RL framework, let's formalize an autonomous driving environment, as shown in the following figure:
Considering the preceding figure, let's now look at each of the components of the environment:
- State: The state can be represented by the 360-degree image of the street around our car. In this case, the state is an image, that is, a matrix of pixels. It can also be represented by a series of images covering the whole space around the car. Another possibility is to describe the state using features and not images. The state can be the current velocity and acceleration of our vehicle, the distance from other cars, or the distance from the street border. In this case, we are using preprocessed information to represent the state more easily. These features can be extracted from images or other types of sensors (for example, Light Detection and Ranging – LIDAR).
- Dynamics: The dynamics of the environment in an autonomous car scenario are represented by the equations describing how the system changes when the car accelerates, breaks, or steers. For instance, the vehicle is going at 30 km/h, and the next vehicle is 100 meters away from it. The state is represented by the car's speed and the proximity information concerning the next vehicle. If the car accelerates, the speed changes according to the car's properties (included in the environment dynamics). Also, the proximity information changes since the next vehicle can be closer or further away (according to the speed). In this situation, at the next timestep, the car's speed can be 35 km/h, and the next vehicle can be closer, for example, only 90 meters away.
- Reward: The reward can represent how well the agent is driving. It's not easy to formalize a reward function. A natural reward function should award states in which the car is aligned to the street and should avoid states in which the car crashes or goes off the road. The reward function definition is an open problem and researchers are putting efforts into developing algorithms where the reward function is not needed (self-motivation or curiosity-driven agents), where the agent learns from demonstrations (imitation learning), and where the agent recovers the reward function from demonstrations (Inverse Reinforcement Learning or IRL).
Note
For further reading on curiosity-driven agents, please refer to the following paper: https://pathak22.github.io/large-scale-curiosity/resources/largeScaleCuriosity2018.pdf.
We are now ready to design and implement our first environment class using Python. We will demonstrate how to implement the state, the dynamics, and the reward of a toy problem in the following exercise.
Exercise 1.01: Implementing a Toy Environment Using Python
In this exercise, we will implement a simple toy environment using Python. The environment is illustrated in Figure 1.11. It is composed of three states (1, 2, 3) and two actions (A and B). The initial state is state 1. States are represented by nodes. Edges represent transitions between states. On the edges, we have an action causing the transition and the associated reward.
The representation of the environment in Figure 1.11 is the standard environment representation in the context of RL. In this exercise, we will become acquainted with the concept of the environment and its implementation:
In the preceding figure, the reward is associated with each state-action pair.
The goal of this exercise is to implement an Environment class with a step() method that takes as input the agent's actions and returns a state-action pair (next state, reward). In addition to this, we will write a reset() method that restarts the environment's state:
- Create a new Jupyter notebook or a simple Python script to enter the code.
- Import the Tuple type from typing:
from typing import Tuple
- Define the class constructor by initializing its properties:
class Environment:
def __init__(self):
"""
Constructor of the Environment class.
"""
self._initial_state = 1
self._allowed_actions = [0, 1] # 0: A, 1: B
self._states = [1, 2, 3]
self._current_state = self._initial_state
Note
The triple-quotes ( """ ) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic.
We have two allowed actions, the action 0 and the action 1 representing the actions A and B. We have three environment states: 1, 2, and 3. We define the current_state variable to be equal to the initial state (state 1).
- Define the step function, which is responsible for updating the current state based on the previous state and the action taken by the agent:
def step(self, action: int) -> Tuple[int, int]:
"""
Step function: compute the one-step dynamic from the \
given action.
Args:
action (int): the action taken by the agent.
Returns:
The tuple current_state, reward.
"""
# check if the action is allowed
if action not in self._allowed_actions:
raise ValueError("Action is not allowed")
reward = 0
if action == 0 and self._current_state == 1:
self._current_state = 2
reward = 1
elif action == 1 and self._current_state == 1:
self._current_state = 3
reward = 10
elif action == 0 and self._current_state == 2:
self._current_state = 1
reward = 0
elif action == 1 and self._current_state == 2:
self._current_state = 3
reward = 1
elif action == 0 and self._current_state == 3:
self._current_state = 2
reward = 0
elif action == 1 and self._current_state == 3:
self._current_state = 3
reward = 10
return self._current_state, reward
Note
The # symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.
We first check that the action is allowed. Then, we define the new current state and reward based on the action and the previous state by looking at the transition in the previous figure.
- Now, we need to define the reset function, which simply resets the environment state:
def reset(self) -> int:
"""
Reset the environment starting from the initial state.
Returns:
The environment state after reset (initial state).
"""
self._current_state = self._initial_state
return self._current_state
- We can use our environment class to understand whether our implementation is correct for the specified environment. We can do this with a simple loop, using a predefined set of actions to test the transitions of our environment. A possible action set, in this case, is [0, 0, 1, 1, 0, 1]. Using this set, we will test all of the environment's transitions:
env = Environment()
state = env.reset()
actions = [0, 0, 1, 1, 0, 1]
print(f"Initial state is {state}")
for action in actions:
next_state, reward = env.step(action)
print(f"From state {state} to state {next_state} \
with action {action}, reward: {reward}")
state = next_state
Note
The code snippet shown here uses a backslash ( \ ) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
The output should be as follows:
Initial state is 1
From state 1 to state 2 with action 0, reward: 1
From state 2 to state 1 with action 0, reward: 0
From state 1 to state 3 with action 1, reward: 10
From state 3 to state 3 with action 1, reward: 10
From state 3 to state 2 with action 0, reward: 0
From state 2 to state 3 with action 1, reward: 1
To understand this better, compare the output with Figure 1.11 to discover whether the transitions and rewards are compatible with the selected actions.
Note
To access the source code for this specific section, please refer to https://packt.live/2Arr9rO.
You can also run this example online at https://packt.live/2zpMul0.
In this exercise, we implemented a simple RL environment by defining the step function and the reset function. These functions are at the core of every environment, representing the interaction between the agent and the environment.
The Agent-Environment Interface
RL considers sequential decision-making problems. In this context, we can refer to the agent as the "decision-maker." In sequential decision-making problems, actions taken by the decision-maker do not only influence the immediate reward and the immediate environment's state, but they also affect future rewards and states. MDPs are a natural way of formalizing sequential decision-making problems. In MDPs, an agent interacts with an environment through actions and receives rewards based on the action, on the current state of the environment, and on the environment's dynamics. The goal of the decision-maker is to maximize the cumulative sum of rewards given a horizon (which is possibly infinite). The task the agent has to learn is defined through the rewards it receives, as you can see in the following figure:
In RL, an episode is pided into a sequence of discrete timesteps: . Here, represents the horizon length, which is possibly infinite. The interaction between the agent and the environment happens at each timestep. At each timestep, the agent receives a representation of the current environment's state, . Based on this state, it selects an action, , belonging to the action space given the current state,. The action affects the environment. As a result, the environment changes its state, transitioning to the next state, , according to its dynamics. At the same time, the agent receives a scalar reward, quantifying how good the action taken in that state was.
Let's now try to understand the mathematical notations used in the preceding example:
- Time horizon : If a task has a finite time horizon, then is an integer number representing the maximum duration of an episode. In infinite tasks, can also be .
- Action is the action taken by the agent in the timestep, t. The action belongs to the action space, , defined by the current state, .
- State is the representation of the environment's state received by the agent at time t. It belongs to the state space, , defined by the environment. It can be represented by an image, a sequence of images, or a simple vector assuming different shapes. Note that the actual environment state can be different and more complex than the state perceived by the agent.
- Reward is represented by a real number, describing how good the taken action was. A high reward corresponds to a good action. The reward is fundamental for the agent to understand how to achieve a goal.
In episodic RL, the agent-environment interaction is pided into episodes; the agent has to achieve the goal within the episode. The interaction is finalized to learn better behavior. After several episodes, the agent can decide to update its behavior by incorporating its knowledge of past interactions. Based on the effect of the action on the environment and the received rewards, the agent will perform more frequent actions yielding higher rewards.
What's the Agent? What's in the Environment?
An important aspect to take into account when dealing with RL is the difference between the agent and the environment. This difference is not typically defined in terms of a physical distinction. Usually, we model the environment as everything that's not under the control of the agent. The environment can include physical laws, other agents, or an agent's properties or characteristics.
However, this does not imply that the agent does not know the environment. The agent can also be aware of the environment and the effect of its actions on it, but it cannot change the way the environment reacts. Also, the reward computation belongs to the environment, as it must be entirely outside the agent's control. If this is not the case, the agent can learn how to modify the reward function in such a way as to maximize its performance without learning the task. The boundary between the agent and environment is a control boundary, meaning that the agent cannot control the reaction of the environment. It is not a knowledge boundary since the agent can know the environment model perfectly and still find difficulties in learning the task.
Environment Types
In this section, we will examine some possible environment dichotomies. The characterization of the environment depends on the state space (finite or continuous), on the type of transitions (deterministic or stochastic), on the information available to the agent (fully or partially observable), and the number of agents involved in the learning problem (single versus multi-agent).
Finite versus Continuous
The state space gives the first distinction. The state space can be pided into two main categories: a finite state space and a continuous state space. A finite state space has a finite number of possible states in which the agent can be, and it's the more straightforward case. An environment with a continuous state space has infinite possible states. In these types of environments, the generalization properties of the agent are fundamental to solve a task because the probability of arriving at the same state twice is almost zero. In continuous environments, an agent cannot use the experience due to the previous presence in that state; it has to generalize using some kind of similarity with respect to the previously experienced states. Note that generalization is also essential for finite state spaces with a considerable number of states (for example, when the state space is represented by the set of all possible images).
Consider the following examples:
- Chess is finite. There is a finite number of possible states in which an agent can be. The state, for chess, is represented by the chessboard situation at a given time. We can calculate all the possible states by varying the situation of the chessboard. The number of states is very high but still finite.
- Autonomous driving can be defined as a continuous problem. If we describe the autonomous driving problem as a problem in which the agent has to make driving decisions based on the sensors' input, we obtain a continuous problem. The sensors provide continuous input in a given range. The agent state, in this case, can be represented by the agent's speed, the agent's acceleration, or the rotation of the wheels per minute.
Deterministic versus Stochastic
A deterministic environment is an environment in which, given a state, an action is performed by the agent; the following state is uniquely determined as well as the following reward. Deterministic environments are simple types of environments, but they are also rarely used due to their limited applicability in the real world.
Almost all real-world environments are stochastic. In stochastic environments, a state and an action performed by the agent determines the probability distribution over the next state and the next reward. The following state is not uniquely determined, but it's uncertain. In these types of environments, the agent should act many times to obtain a reliable estimate of its consequences.
Notice that, in a deterministic environment, the agent could perform each action in each state exactly once, and based on the acquired knowledge, it can solve the task. Also, notice that solving the task does not mean taking actions that yield the highest immediate return, because this action can also bring the agent to an inconvenient part of the environment where future rewards are always low. To solve the task correctly, the agent should take actions with the highest associated future return (called a state-action value). The state-action value does not take into account only the immediate reward but also the future rewards, giving the agent a farsighted view. We will define later what a state-action value is.
Consider the following examples:
- Rubik's Cube is deterministic. To a given action, it corresponds a defined state transition.
- Chess is deterministic but opponent-dependent. The successive state does not depend only on the agent's action but also on the opponent's action.
- Texas Hold'em is stochastic and opponent-dependent. The transition to the next state is stochastic and depends on the deck, which is not known by the agent.
Fully Observable versus Partially Observable
The agent, to plan actions, has to receive a representation of the environment state, (refer to Figure 1.12, The Agent-Environment interface). If the state representation received by the agent completely defines the state of the environment, the environment is Fully Observable. If some parts of the environment are outside the representation observed by the agent, the environment is Partially Observable, also called the Partially Observable Markov Decision Process (POMDP). Partially observable environments are, for example, multi-agent environments. In the case of partially observable environments, the information perceived by the agents, together with the action taken, is not sufficient for determining the next state of the environment. A technique to improve the perception of the agent, making it more accurate, is to keep the history of taken actions and observations, but this requires some memory techniques (such as a Recurrent Neural Network, or RNN, or Long Short-Term Memory, or LSTM) embedded in the agent's policy.
Note
For more information on LSTMs, please refer to https://www.bioinf.jku.at/publications/older/2604.pdf.
POMDP versus MDP
Consider the following figure:
In the preceding figure, the agent does not receive the full environment state but only an observation, .
To better understand the differences between these two types of environments, let's look at Figure 1.13. In partially observable environments (POMDP), the representation given to the agent is only a part of the actual environment state, and it is not enough to understand the actual environment state without uncertainty.
In fully observable environments (MDPs), the state representation given to the agent is semantically equivalent to the state of the environment. Notice that, in this case, the state given to the agent can assume a different form (for example, an image, a vector, a matrix, or a tensor). However, from this representation, it is always possible to reconstruct the actual state of the environment. The meaning of the state is precisely the same, even if under a different form.
Consider the following examples:
- Chess (and, in general, board games) is fully observable. The agent can perceive the whole environment state. In a chess game, the environment state is represented by the chessboard, and the agent can exactly perceive the position of each pawn.
- Poker is partially observable. A poker agent cannot perceive the whole state of the game, which includes the opponent cards and deck cards.
Single Agents versus Multiple Agents
Another useful characteristic of environments is the number of agents involved in a task. If there is only one agent, the subject of our study, the environment is a single-agent environment. If the number of agents is more than one, the environment is a multi-agent environment. The presence of multiple agents increases the complexity of the problem since the action that influences the state becomes a joint action, the set of all the agents' actions. Usually, agents only know their inpidual actions and do not know another agent's actions. For this reason, the multi-agent environment is an instance of POMDP in which the partial visibility is due to the presence of other agents. Notice that each agent has its own observation, which can differ from the other agent's observation, as shown in the following figure:
Consider the following examples:
- Robot navigation is usually a single-agent task. We may have only one agent moving in a possible unknown environment. The goal of the agent can be to reach a given position in the environment while avoiding crashes as much as possible in the minimum amount of time.
- Poker is a multi-agent task where we have two agents competing against each other. The perceived state is different in this case and the perceived reward is also different.
An Action and Its Types
The action set of an agent in an environment can be finite or continuous. If the action set is finite, the agent has at its disposal a finite number of actions. Consider the MountainCar-v0 (discrete) example, described in more detail later. This has a discrete action set; the agent only has to select the direction in which to accelerate, and the acceleration is constant.
If the action set is continuous, the agent has at its disposal infinite actions from which it should select the best actions in a given state. Usually, tasks with continuous action sets are more challenging to solve than those with finite actions.
Let's look at the example of MountainCar-v0:
As you can see in the preceding figure, a car is positioned in a valley between two mountains. The goal of the car is to arrive at the flag on the mountain to its right.
The MountainCar-v0 example is a standard RL benchmark in which there is a car trying to ramp itself up a mountain. The car's engine doesn't have enough strength to ramp upward. For this reason, the car should use the inertia given from the shape of the valley, that is, it should go to the left to gain speed. The state is composed of the car velocity, acceleration, and x position. There are two versions of this task based on the action set we define, as follows:
- MountainCar-v0 discrete: We have only two possible actions, (-1, +1) or (0, 1), depending on the parameterization.
- MountainCar-v0 continuous: A continuous set of actions from -1 to +1.
Policy
We define the policy as the behavior of the agent. Formally, a policy is a function that takes as input the history of the current episode and outputs the current action. The concept of policies has huge importance in RL; all RL algorithms focus on learning the best policy for a given task.
An example of a winning policy for the MountainCar-v0 task is a policy that brings the agent up on the left mountain and then uses the cumulated potential to ramp up the mountain on the right. For negative velocities, the optimal action is LEFT, as the agent should go as high as possible on the left mountain. For positive velocities, the agent should take the action RIGHT, as its goal is to ramp up the mountain on its right.
A Markovian policy is simply a policy depending only on the current state and not the whole history.
We denote a stationary Markovian policy with as follows:
The Markovian policy goes from the state space to the action space. If we evaluate the policy in a given state, , we obtain the selected action, , in that state:
A policy can be implemented in different ways. The most straightforward policy is just a rule-based policy, which is essentially a set of rules or heuristics.
Policies that are a subject of interest in RL are usually parametric. Parametric policies are (differentiable) functions depending on a set of parameters. Usually, the policy parameters are identified as :
The set of policy parameters can be represented by a vector in a d-dimensional space. The selected action is determined by the policy structure (we will explore some possible policy structures later on), by the policy parameters, and, of course, by the current environment state.
Stochastic Policies
The policies presented so far are merely deterministic policies because the output is precisely an action. Stochastic policies are policies that output a distribution over the action space. Stochastic policies are usually powerful policies that mix both exploration and exploitation. With stochastic policies, it is possible to obtain complex behaviors.
A stochastic policy assigns a certain probability to each action. The actions will be selected according to the associated probability.
Figure 1.19 explains, graphically, and with an example, the differences between a stochastic policy and a deterministic policy. The policy in the figure has three possible actions.
The stochastic policy (upper part) assigns to actions, respectively, a probability of 0.2, 0.7, and 0.1. The most probable action is the second action, which is associated with the highest probability. However, all of the actions could also be selected.
In the bottom part, we have the same set of actions with a deterministic policy. The policy, in this case, selects only one action (the second in the figure) with a probability of 1. In this case, actions 1 and 3 will not be selected, having an associated probability of 0.
Note that we can obtain a deterministic policy from a stochastic one by taking the action associated with the highest probability:
Policy Parameterizations
In this section, we will analyze some possible policy parameterizations. Parameterizing a policy means giving a structure to the policy function and considering how parameters affect our output actions. Based on the parameterization, it is possible to obtain simple policies or even complex stochastic policies starting from the same input state.
Linear (Deterministic)
The resulting action is a linear combination of the state features, :
A linear policy is a very simple policy represented by a matrix multiplication.
Consider the example of MountainCar-v0. The state space is represented by the position, speed, and acceleration: . We usually add a constant, 1, that corresponds to the bias term. Therefore, . Policy parameters are defined by . We can simply use as state features the identity function, .
The resulting policy is as follows:
Note
Using a comma, ,, we can denote the column separator, and with a semicolon, ;, we can denote the row separator.
Therefore, is a row vector, and is a column vector that is equivalent to .
If the environment state is [1, 2, 0.1], the cart is in position with velocity and acceleration , and the policy parameters are defined by [4, 5, 1, 1], we obtain an action, .
Since the action space of MountainCar-v0 is defined in the interval, [-1, +1], we need to squash the resulting action using a squashing function such as (hyperbolic tangent). In our case, applied to the output of the multiplication results in approximately +1:
Even if linear policies are simple, they are usually enough to solve most tasks, given that the state features represent the problem.
Gaussian Policy
In the case of Gaussian parameterization, the resulting action has a Gaussian distribution in which the mean, , and the variance, , depend on state features:
Here, with the symbol , we denote the conditional distribution; therefore, with , we denote the distribution conditioned on state .
Remember, the functional form of the Gaussian distribution, , is as follows:
In the case of a Gaussian policy, this becomes the following:
Gaussian parameterization is useful for continuous action spaces. Note that we are giving the agent the possibility of also changing the variance of the distribution. This means that it can decide to increase the variance, enabling it to explore scenarios where it's not sure what the best action to take is, or it can reduce the variance by increasing the amount of exploitation when it's very sure about which action to take in a given state. The effect of the variance can be visualized as follows:
In the preceding figure, if the variance increases (the lower curve), the policy becomes more exploratory. Additionally, actions that are very far from the mean have nonzero probabilities. When the variance is small (the higher curve), the policy is highly exploitative. This means that only actions that are very close to the mean have nonzero probabilities.
In the preceding diagram, the smaller Gaussian represents a highly explorative policy with respect to the larger policy. Here, we can see the effect of the variance on the policy exploration attitude.
While learning a task, in the first training episodes, the policy needs to have a high variance in order for it to explore different actions. The variance will be reduced once the agent gains some experience and becomes more and more confident about the best actions.
The Boltzmann Policy
Boltzmann parameterization is used for discrete action spaces. The resulting action is a softmax function acting on the weighted state features, as stated in the following expression:
Here, is the set of parameters associated with action .
The Boltzmann policy is a stochastic policy. The motivation behind this is very simple; let's sum the policy over all the actions (the denominator does not depend on the action, ), as follows:
The Boltzmann policy becomes deterministic if we select the action with the highest probability, which is equivalent to selecting the mean action in a Gaussian distribution. What the Boltzmann parameterization represents is simply a normalization of the value, , corresponding to the score of action . The score is thus normalized by considering the value of all the other actions obtaining a distribution.
In all of these parametrizations, the state features might be non-linear features depending on several parameters, for example, whether it is coming from a neural network, the radial basis function (RBF) features, or the tile coding features.
Exercise 1.02: Implementing a Linear Policy
In this exercise, we will practice with the implementation of a linear policy. The goal is to write the presented parameterizations in the case of a state composed of components. In the first case, the features can be represented by the identity function; in the second case, the features are represented by a polynomial function of order 2:
- Open a new Jupyter notebook and import NumPy to implement all of the requested policies:
from typing import Callable, List
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
import scipy.stats
- Let's now implement the linear policy. A linear policy can be efficiently represented by a dot product between the policy parameters and state features. The first step is to write the constructor:
class LinearPolicy:
def __init__(
self, parameters: np.ndarray, \
features: Callable[[np.ndarray], np.ndarray]):
"""
Linear Policy Constructor.
Args:
parameters (np.ndarray): policy parameters
as np.ndarray.
features (Callable[[np.ndarray], np.ndarray]):
function used to extract features from the
state representation.
"""
self._parameters = parameters
self._features = features
The constructor simply sets the attribute's parameters and features. The feature parameter is actually a callable that takes, as input, a NumPy array and returns another NumPy array. The input is the environment state, whereas the output is the state features.
- Next, we will implement the call method. The __call__ method takes as input the state, and returns the selected action according to the policy parameters. The call represents a real policy implementation. What we have to do in the linear case is to first apply the feature function and then compute the dot product between the parameters and the features. A possible implementation of the call function is as follows:
def __call__(self, state: np.ndarray) -> np.ndarray:
"""
Call method of the Policy.
Args:
state (np.ndarray): environment state.
Returns:
The resulting action.
"""
# calculate state features
state_features = self._features(state)
"""
the parameters shape [0] should be the same as the
state features as they must be multiplied
"""
assert state_features.shape[0] == self._parameters.shape[0]
# dot product between parameters and state features
return np.dot(self._parameters.T, state_features)
- Let's try the defined policy with a state composed of a 5-dimensional array. Sample a random set of parameters and a random state vector. Create the policy object. The constructor needs the callable features, which, in this case, is the identity function. Call the policy to obtain the resulting action:
# sample a random set of parameters
parameters = np.random.rand(5, 1)
# define the state features as identity function
features = lambda x: x
# define the policy
pi: LinearPolicy = LinearPolicy(parameters, features)
# sample a state
state = np.random.rand(5, 1)
# Call the policy obtaining the action
action = pi(state)
print(action)
The output will be as follows:
[[1.33244481]]
This value is the action selected by our agent, given the state and the policy parameters. In this case, the selected action is [[1.33244481]]. The meaning of the action depends on the RL task.
Of course, you will obtain different results based on the sampled parameters and sampled state. It is always possible to seed the NumPy random number generator to obtain reproducible results.
Note
To access the source code for this specific section, please refer to https://packt.live/2Yvrku7. You can also refer to the Gaussian and Boltzmann policies that are implemented in the same notebook.
You can also run this example online at https://packt.live/3dXc4Nc.
In this exercise, we practiced with different policies and parameterizations. These are simple policies, but they are the building blocks of more complex policies. The trick is just to substitute the state features with a neural network or any other feature extractor.
Goals and Rewards
In RL, the agent's goal is to maximize the total amount of reward it receives during an episode.
This is based on the famous reward hypothesis in Sutton & Barto 1998:
"That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."
The important thing here is that the reward should not describe how to achieve the goal; instead, it should describe the goal of the agent. The reward function is an element of the environment, but it can also be designed for a specific task. In principle, there are infinite reward functions for each task. Usually, reward functions that are characterized by a lot of information help the agent to learn. Sparse reward functions (with no information) makes learning difficult or, sometimes, impossible. Sparse reward functions are functions in which, most of the time, the reward is constant (or zero).
Sutton's hypothesis, which we explained earlier, is the basis of the RL framework. This hypothesis may be wrong; probably, a scalar reward signal (and its maximization) is not enough to define complex goals; however, still, this hypothesis is very flexible, simple, and it can be applied to a wide range of tasks. At the time of writing, the reward function design is more art than engineering; there are no formal practices regarding how to write a reward function, rather there are only best practices based on experience. Usually, a simple reward function works very well. Usually, we associate a positive value with good actions and behavior and negative values with bad actions or actions that are not important at that particular moment.
In a locomotion task (for example, teaching a robot how to move), the reward may be defined as proportional to the robot's forward movement. In chess, the reward may be defined as 0 for each timestep: +1 if the agent wins and -1 if the agent loses. If we want our agent to solve Rubik's Cube, the reward may be defined similarly: 0 every step and +1 if the cube is solved.
Sometimes, as we learned earlier, defining a scalar reward function for a task is not easy, and, nowadays, it is more art than engineering or science.
In each of these tasks, the final objective is to learn a policy, a way of selecting actions, maximizing the total rewards received by the agent. Tasks can be episodic or continuous. Episodic tasks have a finite length, that is, a finite number of timesteps (for example, T is finite). Continuous tasks can last forever or until the agent reaches its goal. In the first case, we can simply define the total reward (return) received by an agent as the sum of the inpidual rewards:
Usually, we are interested in the return from a certain timestep, . In other words, the return, , quantifies the agent's performance in the long term, and it can be calculated as the sum of immediate rewards following time t until the end of the episode (timestep ):
It is straightforward to see that, with this formulation, the return for continuing tasks perges to infinity.
In order to deal with continuing tasks, we need to introduce the notion of a discounted return. This concept formalizes, in mathematical terms, the principle that the immediate reward (sometimes) is more valuable than the same amount of reward after many steps. This principle is widely known in economics. The discount factor, , quantifies the present value of future rewards. We are ready to present the unified notation for the return in episodic and continuing tasks.
The discounted return is the cumulative, discounted sum of rewards until the end of the episode. In mathematical terms, it can be formalized as follows:
To understand how the discount affects the return, it is possible to see that the value of receiving reward after a timestep is , since is less than or equal to . It is worth introducing the effect of the discount on the return. If , the return, even if composed by an infinite sum, has a bounded value. If , the agent is myopic since it cares only about the immediate reward, and it does not care about future rewards. A myopic agent can cause problems: the only thing it learns is to select the action yielding the highest immediate return. A myopic chess player can, for example, eat the opponent's pawn causing the game's loss. Notice that, for some tasks, this isn't always a problem. This includes tasks in which the current action does not affect the future reward and has no consequences for the agent's future. These tasks can be solved by finding the action that causes a higher immediate reward for each state independently. Most of the time, the current action influences the future of the agent and its rewards. If the discount factor is near to 1, the agent is farsighted; it is possible for them to sacrifice an action yielding to a good immediate reward now for a higher reward in future steps.
It is important to understand the relationship between returns at different timesteps, both from a theoretical point of view but also from an algorithmic point of view, because many RL algorithms are based on this principle:
By following these simple steps, we can see that the return from the timestep equals the immediate reward plus the return at the following step scaled by gamma. This simple relationship will be extensively used in RL algorithms.
Why Discount?
The following describes the motivations as to why many RL problems are discounted:
- It is convenient from a mathematical perspective to have a bounded return, and also in the case of continuing tasks.
- If the task is a financial task, immediate rewards may gain more interest than delayed rewards.
- Animal and human behavior show a preference for immediate rewards.
- A discounted reward may also represent uncertainty about the future.
- It is also possible to use an undiscounted return if all of the episodes terminate after a finite number of steps.
This section introduced the main elements of RL, including agents, actions, environments, transition functions, and policies. In the next section, we will practice with these concepts by defining agents, environments, and measuring the performance of agents on some tasks.