Reinforcement Learning Frameworks

In the previous sections, we learned the basic theory behind RL. In principle, an agent or an environment can be implemented in any way or any language. For RL, the primary language used by both academic and industrial people is Python, as it allows you to focus on the algorithms and not on the language details, making it very simple to use. Implementing, from scratch, an algorithm or a complex environment (that is, an autonomous driving environment) might be very difficult and error-prone. For this reason, several well-established and well-tested libraries make RL very easy for newcomers. In this section, we will explore the main Python RL libraries. We will present OpenAI Gym, a set of environments that is ready to use and easy to modify, and OpenAI Baselines, a set of high quality, state-of-the-art algorithms. By the end of this chapter, you will have learned about and practiced with environments and agents.

OpenAI Gym

OpenAI Gym (https://gym.openai.com) is a Python library that provides a set of RL environments ranging from toy environments to Atari environments and more complex environments, such as MuJoCo and Robotics environments. OpenAI Gym, besides providing this large set of tasks, also provides a unified interface for interacting with RL tasks and a set of interfaces that are useful for describing the environment's characteristics, such as the action space and the state space. An important property of Gym is that its only focus is on environments; it makes no assumption of the type of agent you have or the computational framework you use. We will not cover the installation details in this chapter for ease of presentation. Instead, we will focus on the main concepts and learn how to interact with these libraries.

Getting Started with Gym – CartPole

CartPole is a classical control environment provided by Gym and used by researchers as a starting point of algorithms. It consists of a cart that moves along the horizontal axis (1-dimensional) and a pole anchored to the cart on one endpoint:

Figure 1.33: CartPole environment representation

The agent has to learn how to move the cart to balance the pole (that is, to stop the pole from falling). The episode ends when the pole angle (71) becomes higher than a certain threshold (72). The state space is represented by the position of the cart along the axis, 73; the velocity along the axis, 74; the pole angle, 75; and the pole angular velocity, 76. The state space is continuous in this case, but it can also be discretized to make learning simpler.

In the following steps, we will practice with Gym and its environments.

Let's create a CartPole environment using Gym and analyze its properties in a Jupyter notebook. Please refer to the Preface for Gym installation instructions:

# Import the gym Library

import gym

# Create the environment using gym.make(env_name)

env = gym.make('CartPole-v1')

"""

Analyze the action space of cart pole using the property action_space

"""

print("Action Space:", env.action_space)

"""

Analyze the observation space of cartpole using the property observation_space

"""

print("Observation Space:", env.observation_space)

If you run these lines, you will get the following output:

Action Space: Discrete(2)

Observation Space: Box(4,)

Discrete(2) means that the action space of CartPole is a discrete action space composed of two actions: Go Left and Go Right. These actions are the only actions available to the agent. The action of Go Left, in this case, is represented by action 0, and the action of Go Right by action 1.

Box(4,) means that the state space (the observation space) of the environment is represented by a 4-dimensional box, a subspace of d. Formally, it is a Cartesian product of n intervals. The state space has a lower bound and an upper bound. The bounds may also be infinite, creating an unbounded box.

To inspect the observation space better, we can use the properties of high and low:

# Analyze the bounds of the observation space

print("Lower bound of the Observation Space:", \

      env.observation_space.low)

print("Upper bound of the Observation Space:", \

      env.observation_space.high)

This will print the following:

Lower bound of the Observation Space: [-4.8000002e+00 -3.4028235e+38

-4.1887903e-01 -3.4028235e+38]

Upper bound of the Observation Space: [4.8000002e+00 3.4028235e+38

4.1887903e-01 3.4028235e+38]

Here, we can see that upper and lower bounds are arrays of 4 elements; one element for each state dimension. The following are some observations:

  • The lower bound of the cart position (the first state dimension) is -4.8, while the upper bound is 4.8.
  • The lower bound of the velocity (the second state dimension) is -3.1038, basically 77; and the upper bound is +3.1038, basically 78.
  • The lower bound of the pole angle (the third state dimension) is -0.4 radians, representing an angle of -24 degrees. The upper bound is 0.4 radians, representing an angle of +24 degrees.
  • The lower and upper bounds of the pole angular velocity (the fourth state dimension) are, respectively, 79 and 80 , similar to the lower and upper bounds for the cart policy's angular velocity.

Gym Spaces

The Gym Space class represents the way Gym describes actions and state spaces. The most used spaces are the Discrete and Box spaces.

A discrete space is composed of a fixed number of elements. It can represent both a state space but also an action space, and it describes the number of elements through the n attribute. Its elements range from 0 to n-1.

A Box space describes its shape through the shape attribute. It can have an n-dimensional shape that corresponds to an n-dimensional box. A Box space can also be unbounded. Each interval has the form of one of 81.

It is possible to sample from the action space to gain insight into the elements it is composed of using the space.sample() method.

Note

For the sampling distribution of box environments, to create a sample of the box, each coordinate is sampled according to the form of the interval in the following distributions:

- 82 : A uniform distribution

-83: A shifted exponential distribution

- 84: A shifted negative exponential distribution

- 85: A normal distribution

Let's now demonstrate how to create simple spaces and how to sample from spaces:

# Type hinting

from typing import Tuple

import gym

# Import the spaces module

from gym import spaces

# Create a discrete space composed by N-elements (5)

n: int = 5

discrete_space = spaces.Discrete(n=n)

# Sample from the space using .sample method

print("Discrete Space Sample:", discrete_space.sample())

"""

Create a Box space with a shape of (4, 4)

Upper and lower Bound are 0 and 1

"""

box_shape: Tuple[int, int] = (4, 4)

box_space = spaces.Box(low=0, high=1, shape=box_shape)

# Sample from the space using .sample method

print("Box Space Sample:", box_space.sample())

This will print the samples from our spaces:

Discrete Space Sample: 4

Box Space Sample: [[0.09071387 0.4223234 0.09272052 0.15551752]

 [0.8507258 0.28962377 0.98583364 0.55963445]

 [0.4308358 0.8658449 0.6882108 0.9076272 ]

 [0.9877584 0.7523759 0.96407163 0.630859 ]]

Of course, the samples will change according to your seeds.

As you can see, we have sampled element 4 from our discrete space composed of 5 elements (from 0 to 4). We sampled a random 4 x 4 matrix with elements between 0 and 1, the lower and the upper bound of our space.

To obtain reproducible results, it is also possible to set the seed of an environment using the seed method:

# Seed spaces to obtain reproducible samples

discrete_space.seed(0)

box_space.seed(0)

# Sample from the seeded space

print("Discrete Space (seed=0) Sample:", discrete_space.sample())

# Sample from the seeded space

print("Box Space (seed=0) Sample:", box_space.sample())

This will print the following:

Discrete Space (seed=0) Sample: 0

Box Space (seed=0) Sample: [[0.05436005 0.9653909

0.63269097 0.29001734]

 [0.10248426 0.67307633 0.39257675 0.66984606]

 [0.05983897 0.52698725 0.04029069 0.9779441 ]

 0.46293673 0.6296479 0.9470484 0.6992778 ]]

The previous statement will always print the same sample since we set the seed to 0. Seeding an environment is very important in order to guarantee reproducible results.

Exercise 1.03: Creating a Space for Image Observations

In this exercise, we will create a space to represent an image observation. Image-based observations are essential in RL since they allow the agent to learn from pixels and require minimal feature engineering or need to go through the feature extraction phase. The agent can focus on what is important for its task without being limited by manually decided heuristics. We will create a space representing RGB images with dimensions equal to 256 x 256:

  1. Open a new Jupyter notebook and import the desired modules – gym and NumPy:

    import gym

    from gym import spaces

    import matplotlib.pyplot as plt

    %matplotlib inline

    import numpy as np # used for the dtype of the space

  2. We are dealing with 256 x 256 RGB images, so the space has a shape of (256, 256, 3). In addition, the images range from 0 to 255 (if we consider the uint8 images):

    """

    since the Space is RGB images with shape 256x256 the final shape is (256, 256, 3)

    """

    shape = (256, 256, 3)

    # If we consider uint8 images the bounds are 0-255

    low = 0

    high = 255

    # Space type: unsigned int

    dtype = np.uint8

  3. We are now ready to create the space. An image is a Box space since it has defined bounds:

    # create the space

    space = spaces.Box(low=low, high=high, shape=shape, dtype=dtype)

    # Print space representation

    print("Space", space)

    This will print the representation of our space:

    Space Box(256, 256, 3)

    The first dimension is the image width, the second dimension is the image height, and the third dimension is the number of channels.

  4. Here is a sample from the space:

    # Sample from the space

    sample = space.sample()

    print("Space Sample", sample)

    This will return the space sample; in this case, it is a huge tensor of 256 x 256 x 3 unsigned integers (between 0 and 255). The output (fewer lines are presented now) should be similar to the following:

    Space Sample [[[ 37 254 243]

      [134 179 12]

      [238 32 0]

      ...

      [100 61 73]

      [103 164 131]

      [166 31 68]]

     [[218 109 213]

      [190 22 130]

      [ 56 235 167]

  5. To visualize the returned sample, use the following code:

    plt.imshow(sample)

    The output will be as follows:

    Figure 1.34: A sample from a Box space of (256, 256) RGB

    The preceding is not very informative because it is a random image.

  6. Now, suppose we want to give our agent the opportunity to see the last n=4 frames. By adding the temporal component, we can obtain a state representation composed of 4 dimensions. The first dimension is the temporal one, the second is the width, the third is the height, and the last one is the number of channels. This is a very useful technique that allows the agent to understand its movement:

    # we want a space representing the last n=4 frames

    n_frames = 4 # number of frames

    width = 256 # image width

    height = 256 # image height

    channels = 3 # number of channels (RGB)

    shape_temporal = (n_frames, width, height, channels)

    # create a new instance of space

    space_temporal = spaces.Box(low=low, high=high, \

                                shape=shape_temporal, dtype=dtype)

    print("Space with temporal component", space_temporal)

    This will print the following:

    Space with temporal component Box(4, 256, 256, 3)

As you can see, we have successfully created a space and, on inspecting the space representation, we notice that we have another dimension: the temporal dimension.

Note

To access the source code for this specific section, please refer to https://packt.live/2AwJm7x.

You can also run this example online at https://packt.live/2UzxoAY.

Image-based environments are very important in RL. They allow the agent to learn salient features for solving the task directly from raw pixels, without any preprocessing. In this exercise, we learned how to create a Gym space for image observations and how to deal with image spaces.

Rendering an Environment

In the Getting Started with Gym – CartPole section, we saw a sample from the CartPole state space. However, visualizing or understanding the CartPole state from a vector representation is not an easy task, at least for a human. Gym also allows you to visualize a given task (if possible) through the env.render() function.

Note

The env.render() function is usually slow. Rendering an environment is done primarily to understand the behavior learned by the agent after the training or on intervals of many training steps. Usually, we train agents without rendering the environment state to improve the training speed.

If we just call the env.render() function, we will always see the same scene, that is, the environment state does not change. To see the evolution of the environment in time, we must call the env.step() function, which takes as input an action belonging to the action space and applies the action in the environment.

Rendering CartPole

The following code demonstrates how to render the CartPole environment. The action is a sample from the action space. For RL algorithms, the action will be smartly selected from the policy:

# Create the environment using gym.make(env_name)

env = gym.make("CartPole-v1")

# reset the environment (mandatory)

env.reset()

# render the environment for 100 steps

n_steps = 100

for i in range(n_steps):

    action = env.action_space.sample()

    env.step(action)

    env.render()

# close the environment correctly

env.close()

If you run this script, you will see that gym opens a window and displays the CartPole environment with random actions, as shown in the following figure:

Figure 1.35: A CartPole environment rendered in Gym (the initial state)

A Reinforcement Learning Loop with Gym

To understand the consequences of an action, and to come up with a better policy, the agent observes its new state and a reward. Implementing this loop with gym is easy. The key element is the env.step() function. This function takes an action as input. It applies the action and returns four values, which are described as follows:

  • Observation: The observation is the next environmental state. This is represented as an element belonging to the observation space of the environment.
  • Reward: The reward associated with a step is a float value that is related to the action given as input to the function.
  • Done: This return value assumes the True value when the episode is finished, and it's time to call the env.reset() function to reset the environment state.
  • Info: This is a dictionary containing debugging information; usually, it is ignored.

Let's now implement the RL loop within the Gym environment.

Exercise 1.04: Implementing the Reinforcement Learning Loop with Gym

In this exercise, we will implement a basic RL loop with episodes and timesteps using the CartPole environment. You can change the environment and use other environments as well; nothing changes as the main goal of Gym is to unify the interfaces of all possible environments in order to build agents that are as environment-agnostic as possible. The transparency with respect to the environment is a very peculiar thing in RL: the algorithms are not usually suited to the task but are task-agnostic so that they can be applied successfully to a variety of environments and still solve them.

We need to create the Gym CartPole environment as before using the gym.make() function. After that, we can loop for a defined number of episodes; for each episode, we loop for a defined number of steps or until the episode is terminated (by checking the done value). For each timestep, we have to call the env.step() function by passing an action (we will pass a random action for now), and then we collect the desired information:

  1. Open a new Jupyter notebook and define the import, the environment, and the desired number of steps:

    import gym

    import matplotlib.pyplot as plt

    %matplotlib inline

    env = gym.make("CartPole-v1")

    # each episode is composed by 100 timesteps

    # define 10 episodes

    n_episodes = 10

    n_timesteps = 100

  2. Loop for each episode:

    # loop for the episodes

    for episode_number in range(n_episodes):

        # here we are inside an episode

  3. Reset the environment and get the first observation:

        """

        the reset function resets the environment and returns

        the first environment observation

        """

        observation = env.reset()

  4. Loop for each timestep:

        """

        loop for the given number of timesteps or

        until the episode is terminated

        """

        for timestep_number in range(n_timesteps):

  5. Render the environment, select the action (randomly by using the env.action_space.sample() method), and then take the action:

            # render the environment

            env.render(mode="rgb-array")

            # select the action

            action = env.action_space.sample()

            # apply the selected action by calling env.step

            observation, reward, done, info = env.step(action)

  6. Check whether the episode has been terminated using the done variable:

            """if done the episode is terminated, we have to reset

            the environment

            """

            if done:

                print(f"Episode Number: {episode_number}, \

    Timesteps: {timestep_number}")

                # break from the timestep loop

                break

  7. After the episode loop, close the environment in order to release the associated memory:

    # close the environment

    env.close()

    If you run the previous code, the output should, approximately, be like this:

    Episode Number: 0, Timesteps: 34

    Episode Number: 1, Timesteps: 10

    Episode Number: 2, Timesteps: 12

    Episode Number: 3, Timesteps: 21

    Episode Number: 4, Timesteps: 16

    Episode Number: 5, Timesteps: 17

    Episode Number: 6, Timesteps: 12

    Episode Number: 7, Timesteps: 15

    Episode Number: 8, Timesteps: 16

    Episode Number: 9, Timesteps: 16

We have the episode number and the number of steps taken in that episode. We can see that the average number of timesteps for an episode is approximately 17. This means that, using the random policy, after 17 episodes on average, the pole falls and the episode finishes.

Note

To access the source code for this specific section, please refer to https://packt.live/2MOs5t5.

This section does not currently have an online interactive example, and will need to be run locally.

The goal of this exercise was to understand the bare bones of each RL algorithm. The only different thing here is that the action selection phase should take into account the environment state in order for it to be useful, and it should not be random.

Let's now move toward completing an activity to measure the performance of an agent.

Activity 1.01: Measuring the Performance of a Random Agent

The measurement of the performance and the design of an agent is an essential phase of every RL experiment. The goal of this activity is to practice with these two concepts by designing an agent that is able to interact with an environment using a random policy and then measure the performance.

You need to design a random agent using a Python class to modularize and keep the agent independent from the main loop. After that, you have to measure the mean and the variance of the discounted return using a batch of 100 episodes. You can use every environment you want, taking into account that the agent's action should be compatible with the environment. You can design two different types of agents for discrete action spaces and continuous action spaces. The following steps will help you to complete the activity:

  1. Import the required libraries: abc, numpy, and gym.
  2. Define the Agent abstract class in a very simple way, defining only the pi() function that represents the policy. The input should be an environment state. The __init__ method should take as input the action space and build the distribution accordingly.
  3. Define a ContinuousAgent deriving from the Agent abstract class. The agent should check that the action space is coherent with it, and it should be a continuous action space. The agent should also initialize a probability distribution for sampling actions (you can use NumPy to define probability distributions). The continuous agent can change the distribution type according to the distributions defined by the Gym spaces.
  4. Define a DiscreteAgent deriving from the Agent abstract class. The discrete agent should, of course, initialize a uniform distribution.
  5. Implement the pi() function for both agents. This function is straightforward and should only sample from the distribution defined in the constructor and return it, ignoring the environment state. Of course, this is a simplification. You can also implement the pi() function in the Agent base class.
  6. Define the main RL loop in another file by importing the agent.
  7. Instantiate the correct agent according to the selected environment. Examples of environments are "CartPole-v1" or "MountainCar-Continuous-v0."
  8. Take actions according to the pi function of the agent.
  9. Measure the performance of the agent collecting (in a list or a NumPy array) the discounted return for each episode. Then, take the average and the standard deviation (you can use NumPy for this). Remember to apply the discount factor (user-defined) to the immediate reward. You have to keep a cumulated discount factor by multiplying the discount factor at each timestep.

    The output should be similar to the following:

    Episode Number: 0, Timesteps: 27, Return: 28.0

    Episode Number: 1, Timesteps: 9, Return: 10.0

    Episode Number: 2, Timesteps: 13, Return: 14.0

    Episode Number: 3, Timesteps: 16, Return: 17.0

    Episode Number: 4, Timesteps: 31, Return: 32.0

    Episode Number: 5, Timesteps: 10, Return: 11.0

    Episode Number: 6, Timesteps: 14, Return: 15.0

    Episode Number: 7, Timesteps: 11, Return: 12.0

    Episode Number: 8, Timesteps: 10, Return: 11.0

    Episode Number: 9, Timesteps: 30, Return: 31.0

    Statistics on Return: Average: 18.1, Variance: 68.89000000000001

    Note

    The solution to this activity can be found on page 680.

OpenAI Baselines

OpenAI Baselines (https://github.com/openai/baselines) is a set of state-of-the-art RL algorithms. The main goal of Baselines is to make it easier to reproduce results on a set of benchmarks, to evaluate new ideas, and to compare them to existing algorithms. In this section, we will learn how to use Baselines to run an existing algorithm on an environment taken from Gym (refer to the previous section) and how to visualize the behavior learned by the agent. As for Gym, we will not cover the installation instructions; these can be found in the Preface section. The implementation of the Baselines' algorithm is based on TensorFlow, one of the most popular libraries for machine learning.

Getting Started with Baselines – DQN on CartPole

Training a Deep Q Network (DQN) on CartPole is straightforward with Baselines; we can do it with just one line of Bash.

Just use the terminal and run this command:

# Train model and save the results to cartpole_model.pkl

python -m baselines.run –alg=deepq –env=CartPole-v0 –save_path=./cartpole_model.pkl –num_timesteps=1e5

Let's understand the parameters, as follows:

  • --alg=deepq specifies the algorithm to be used to train our agent. In our case, we selected deepq, that is, DQN.
  • --env=CartPole-v0 specifies the environment to be used. We selected CartPole, but we can also select many other environments.
  • --save_path=./cartpole_model.pkl specifies where to save the trained agent.
  • --num_timesteps=1e5 is the number of training timesteps.

After having trained the agent, it is also possible to visualize the learned behavior using the following:

# Load the model saved in cartpole_model.pkl

# and visualize the learned policy

python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play

DQN is a very powerful algorithm; using it for a simple task such as CartPole is almost overkill. We can see that the agent has learned a stable policy, and the pole almost never falls. We will explore DQN in more detail in the following chapters.

In the following steps, we will train a DQN agent on the CartPole environment using Baselines:

  1. First, we import gym and baselines:

    import gym

    # Import the desired algorithm from baselines

    from baselines import deepq

  2. Define a callback to inform baselines when to stop training. The callback should return True if the reward is satisfying:

    def callback(locals, globals):

        """

        function called at every step with state of the algorithm.

        If callback returns true training stops.

        stop training if average reward exceeds 199

        time should be greater than 100 and the average of

        last 100 returns should be >= 199

        """

        is_solved = (locals["t"] > 100 and \

                     sum(locals["episode_rewards"]\

                               [-101:-1]) / 100 >= 199)

        return is_solved

  3. Now, let's create the environment and prepare the algorithm's parameters:

    # create the environment

    env = gym.make("CartPole-v0")

    """

    Prepare learning parameters: network and learning rate

    the policy is a multi-layer perceptron

    """

    network = "mlp"

    # set learning rate of the algorithm

    learning_rate = 1e-3

  4. We can use the deep.learn() method to start the training and solve the task:

    """

    launch learning on this environment using DQN

    ignore the exploration parameter for now

    """

    actor = deepq.learn(env, network=network, lr=learning_rate, \

                        total_timesteps=100000, buffer_size=50000, \

                        exploration_fraction=0.1, \

                        exploration_final_eps=0.02, print_freq=10, \

                        callback=callback,)

After some time, depending on your hardware (it usually takes a few minutes), the learning phase terminates, and you will have the CartPole agent saved to your current working directory.

We should see the baselines logs reporting the agent's performance over time.

Consider the following example:

--------------------------------------

| % time spent exploring | 2 |

| episodes | 770 |

| mean 100 episode reward | 145 |

| steps | 6.49e+04 |

The following are the observations from the preceding logs:

  • The episodes parameter reports the episode number we are referring to.
  • mean 100 episode reward is the average return obtained in the last 100 episodes.
  • steps is the number of training steps the algorithm has performed.

Now we can save our actor so that we can reuse it without retraining it:

print("Saving model to cartpole_model.pkl")

actor.save("cartpole_model.pkl")

After the actor.save function, the "cartpole_model.pkl" file contains the trained model.

Now it is possible to use the model and visualize the agent's behavior.

The actor returned by deepq.learn is actually a callable that returns the action given the current observation – it is the agent policy. We can use it by passing the current observation, and it returns the selected action:

# Visualize the policy

n_episodes = 5

n_timesteps = 1000

for episode in range(n_episodes):

    observation = env.reset()

    episode_return = 0

    for timestep in range(n_timesteps):

        # render the environment

        env.render()

        # select the action according to the actor

        action = actor(observation[None])[0]

        # call env.step function

        observation, reward, done, _ = env.step(action)

        """

        since the reward is undiscounted we can simply add

        the reward to the cumulated return

        """

        episode_return += reward

        if done:

            break

    # here an episode is terminated, print the return

    print("Episode return", episode_return)

        """

        here an episode is terminated, print the return

        and the number of steps

        """

    print(f"Episode return {episode_return}, \

Number of steps: {timestep}")

If you run the preceding code, you should see the agent's performance on the CartPole task.

You should get, as output, the return for each episode; it should be something similar to the following:

Episode return 200.0, Number of steps: 199

Episode return 200.0, Number of steps: 199

Episode return 200.0, Number of steps: 199

Episode return 200.0, Number of steps: 199

Episode return 200.0, Number of steps: 199

This means our agent always reaches the maximum possible return for CartPole (200.0) and the maximum possible number of steps (199).

We can compare the return obtained using a trained DQN agent with respect to the return obtained using a random agent (Activity 1.01, Measuring the Performance of a Random Agent). The random agent yields an average return of 20.0, while DQN obtains the maximum return possible for CartPole, which is 200.0.

In this section, we presented OpenAI Gym and OpenAI Baselines, the two main frameworks for RL research and experiments. There are many other frameworks for RL, each with their pros and cons. Gym is particularly suited due to its unified interface in the RL loop, while OpenAI Baselines is very useful for understanding how to implement sophisticated state-of-the-art RL algorithms and how to compare new algorithms with existing ones.

In the following section, we will explore some interesting RL applications in order to better understand the possibilities offered by the framework as well as its flexibility.