TensorFlow for Reinforcement Learning

In this section, we will learn how to create, run, and save a policy network using TensorFlow. Policy networks are one of the fundamental pieces, if not the most important one, of reinforcement learning. As will be shown throughout this book, they are a very powerful implementation of containers for the knowledge the agent has to learn, which tells them how to choose actions based on environment observations.

Implementing a Policy Network Using TensorFlow

Building a policy network is not too different from building a common deep learning model. Its goal is to output the "optimal" action, given the input it receives, that represents the environment's observation. So, it acts as a link between the environment state and the optimal agent behavior associated with it. Being optimal here means doing what maximizes the cumulative expected reward of the agent.

To make things as clear as possible, we will focus on a specific problem here, but the same approach can be adopted to solve other tasks, such as controlling a robotic arm or teaching locomotion to a humanoid robot. We will see how to create a policy network for a classic control problem that will also be at the core of an exercise later in this chapter. This problem is the "CartPole" problem: the goal is to maintain the balance of the vertical pole so that it is upright at all times. Here, the only way to do this is by moving the cart along either direction of the x-axis. The following figure shows a frame from this problem:

Figure 4.11: CartPole control problem

As we mentioned previously, the policy network links the observations of the environment with the actions that the agent can take. So, they act as the input and output, respectively.

As we saw in the previous chapter, this is the first information that you need in order to build a neural network. To retrieve the input and output dimensions, you have to instantiate the environment (in this case, this is done via OpenAI Gym) and print out information about the observation and action spaces.

Let's perform this first task by completing the following exercise.

Exercise 4.02: Building a Policy Network with TensorFlow

In this exercise, we will learn how to build a policy network with TensorFlow for a given Gym environment. We will learn how to take its observation space and action space into account, which constitute the input and output of the network, respectively. We will then create a deep learning model that is able to generate actions for the agent in the environment in response to environment observations. This network is the piece that needs to be trained and is the final goal of every RL algorithm. Follow these steps to complete this exercise:

  1. Import the required modules:

    import numpy as np

    import gym

    import tensorflow as tf

  2. Instantiate the environment:

    env = gym.make('CartPole-v0')

  3. Print out the action and observation spaces:

    print("Action space =", env.action_space)

    print("Observation space =", env.observation_space)

    This prints out the following:

    Action space = Discrete(2)

    Observation space = Box(4,)

  4. Print out the action and observation space dimensions:

    print("Action space dimension =", env.action_space.n)

    print("Observation space dimension =", \

          env.observation_space.shape[0])

    The output will be as follows:

    Action space dimension = 2

    Observation space dimension = 4

    As you can see from the preceding output, the action space is a discrete space of dimension 2, meaning it can take the value 0 or 1. The observation space is of the Box type with a dimension of 4, meaning it consists of four real numbers inside the lower and upper boundaries, which, as we already saw for the CartPole environment, are [±2.4, ± inf, ±0.20943951, ±inf].

    With this information, it is now possible to build a policy network that can be interfaced with the CartPole environment. The following code block shows one of many possible choices: it uses two hidden layers with 64 neurons each and an output layer with 2 neurons (as this is the action space's dimension) with a softmax activation function. The model summary prints out the outline of the model.

  5. Build the policy network and print its summary:

    model = tf.keras.Sequential\

            ([tf.keras.layers.Dense(64, activation='relu', \

              input_shape=[env.observation_space.shape[0]]), \

              tf.keras.layers.Dense(64, activation='relu'), \

              tf.keras.layers.Dense(env.action_space.n, \

              activation="softmax")])

    model.summary()

    The output will be as follows:

    Model: "sequential_2"

    _________________________________________________________________

    Layer (type) Output Shape Param # =================================================================

    dense (Dense) (None, 64) 320 _________________________________________________________________

    dense_1 (Dense) (None, 64) 4160 _________________________________________________________________

    dense_2 (Dense) (None, 2) 130 =================================================================

    Total params: 4,610

    Trainable params: 4,610

    Non-trainable params: 0

As you can see, the model has been created and we also have an elaborate summary of it, which gives us significant information about the model, regarding the layers, the parameters of the network, and so on.

Note

To access the source code for this specific section, please refer to https://packt.live/3fkxfce.

You can also run this example online at https://packt.live/2XSXHnF.

Once the policy network has been built and initialized, it is possible to feed it. Of course, since the network hasn't been trained, it will generate random outputs, but still, it can be used, for example, to run a random agent in an environment of choice. This is what we will implement in the following exercise: the neural network model will be fed with the observation provided by the environment step or reset function through the predict method. This outputs the action probabilities. The action with the highest probability is chosen and used to step through the environment until the episode ends.

Exercise 4.03: Feeding the Policy Network with Environment State Representation

In this exercise, we will be feeding information to the policy network with the environment state representation. This exercise is a continuation of Exercise 4.02, Building a Policy Network with TensorFlow, so in order to carry it out, you need to perform all the steps of the preceding exercise and then begin this one right after. Follow these steps to complete this exercise:

  1. Reset the environment:

    t = 1

    observation = env.reset()

  2. Start a loop that will run until the episode is complete. Render the environment and print the observations:

    while True:

        env.render()

        # Print the observation

        print("Observation = ", observation)

  3. Feed the network with the environment observations, let it choose the appropriate actions, and print it:

        action_probabilities =model.predict\

                              (np.expand_dims(observation, axis=0))

        action = np.argmax(action_probabilities)

        print("Action = ", action)

  4. Step through the environment with the selected action. Print the received reward and close the environment if the terminal state has been reached:

        observation, reward, done, info = env.step(action)

        # Print received reward

        print("Reward = ", reward)

        # If terminal state reached, close the environment

        if done:

            print("Episode finished after {} timesteps".format(t+1))

            break

        t += 1

    env.close()

    This produces the following output (only the last few lines have been shown):

    Observation = [-0.00324467 -1.02182257 0.01504633 1.38740738]

    Action = 0

    Reward = 1.0

    Observation = [-0.02368112 -1.21712879 0.04279448 1.684757 ]

    Action = 0

    Reward = 1.0

    Observation = [-0.0480237 -1.41271906 0.07648962 1.99045154]

    Action = 0

    Reward = 1.0

    Observation = [-0.07627808 -1.60855467 0.11629865 2.30581208]

    Action = 0

    Reward = 1.0

    Observation = [-0.10844917 -1.80453455 0.16241489 2.63191088]

    Action = 0

    Reward = 1.0

    Episode finished after 11 timesteps

    Note

    To access the source code for this specific section, please refer to https://packt.live/2AmwUHw.

    You can also run this example online at https://packt.live/3kvuhVQ.

By completing this exercise, we've built a policy network and used it to guide an agent's behavior in a Gym environment. At the moment, it behaves randomly, but apart from policy network training, which will be explained in the following chapters, every other piece of the big picture is already in place.

How to Save a Policy Network

The goal of reinforcement learning is to effectively train the network so that it learns how to perform the optimal action for every given environment state. RL theory deals with how to achieve this goal and, as we will see, different approaches have been successful. Supposing one of them has been applied to the previous network, the trained model needs to be saved so that it can be loaded every time it needs to run the agent on the environment.

To save the policy network, we need to follow the very same steps of saving a common neural network, where all the weights of all the layers are dumped into a save file to be loaded again in the network at a later stage. The following code is an example of this implementation:

save_dir = "./"

model_name = "modelName"

print("Saving best model to {}".format(save_dir))

model.save_weights(os.path.join(save_dir,\

                                'model_{}.h5'.format(model_name)))

This produces the following output:

Saving best model to ./

In this section, we learned how to create, run, and save a policy network using TensorFlow. Once the inputs (environment states/observations) and outputs (actions the agent can perform) are clear, there is no big difference with respect to standard deep neural networks. The model has also been used to run the agent. When fed with the environment state, it produced actions for the agent to take. Being an untrained network, the agent behaved randomly. The only missing piece in this section is how to effectively train the policy network, which is the goal of reinforcement learning and will be covered in detail in this book and, partially, in the following sections.

Now that we've learned how to build a policy network with TensorFlow, let's pe into another OpenAI resource that will allow us to easily train an RL agent.