OpenAI Baselines

So far, we have studied the two different frameworks that allow us to solve reinforcement learning problems (OpenAI Gym and OpenAI Universe). We also studied how to create the "brain" of the agent, known as the policy network, with TensorFlow.

The next step is to train the agent and make it learn how to act optimally, only through experience. Learning how to train an RL agent is the ultimate goal of this book. We will see how most advanced methods work and find out about all their internal elements and algorithms. But even before we find out all the details of how these approaches are implemented, it is possible to rely on some tools that make the task more straightforward.

OpenAI Baselines is a Python-based tool, built on TensorFlow, that provides a library of high-quality, state-of-the-art implementations of reinforcement learning algorithms. It can be used as an out-of-the-box module, but it can also be customized and expanded. We will be using it to solve a classic control problem and a classic Atari video game by training a custom policy network.

Note

Please make sure you have installed OpenAI Baselines by using the instructions mentioned in the preface, before moving on.

Proximal Policy Optimization

It is worth providing a high-level idea of what Proximal Policy Optimization (PPO) is. We will remain at the highest level when describing this state-of-the-art RL algorithm because, in order to deeply understand how it works, you will need to become familiar with the topics that will be presented in the following chapters, thereby preparing you to study and build other state-of-the-art RL methods by the end of this book.

PPO is a reinforcement learning method that is part of the policy gradient family. Algorithms in this category aim to directly optimize the policy, instead of building a value function to then generate a policy. To do so, they instantiate a policy (in our case, in the form of a deep neural network) and build a method to calculate a gradient that defines where to move the policy function's approximator parameters (the weights of our deep neural network, in our case) to directly improve the policy. The word "proximal" suggests a specific feature of these methods: in the policy update step, when adjusting policy parameters, the update is constrained, thus preventing it from moving "too far" from the starting policy. All these aspects will be transparent to the user, thanks to the OpenAI Baselines tool, which will take care of carrying out the job under the hood. You will learn about these aspects in the upcoming chapters.

Note

Please refer to the following paper to learn more about PPO: https://arxiv.org/pdf/1707.06347.pdf.

Command-Line Usage

As stated earlier, OpenAI Baselines allows us to train state-of-the-art RL algorithms easily for OpenAI Gym problems. The following code snippet, for example, trains a PPO algorithm for 20 million steps in the Pong Gym environment:

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4

    --num_timesteps=2e7 --save_path=./models/pong_20M_ppo2

    --log_path=./logs/Pong/

It saves the model in the user-defined save path so that it is possible to reload the weights on the policy network and deploy the trained agent in the environment with the following command-line instruction:

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4

    --num_timesteps=0 --load_path=./models/pong_20M_ppo2 --play

You can easily train every available method on every OpenAI Gym environment by changing only the command-line arguments, without knowing anything about how they work internally.

Methods in OpenAI Baselines

OpenAI Baselines gives us access to the following RL algorithm implementations:

  • A2C: Advantage Actor-Critic
  • ACER: Actor-Critic with Experience Replay
  • ACKTR: Actor-Critic using Kronecker-factored Trust Region
  • DDPG: Deep Deterministic Policy Gradient
  • DQN: Deep Q-Network
  • GAIL: Generative Adversarial Imitation Learning
  • HER: Hindsight Experience Replay
  • PPO2: Proximal Policy Optimization
  • TRPO: Trust Region Policy Optimization

For the upcoming exercise and activity, we will be using PPO.

Custom Policy Network Architecture

Despite its out-of-the-box usability, OpenAI Baselines can also be customized and expanded. In particular, as something that will also be used in the next two sections of this chapter, it is possible to provide a custom definition to the module for the policy network architecture.

One aspect that needs to be clear is the fact that the network will be used as an encoder of the environment state or observation. OpenAI Baselines will then take care of creating the final layer, which is in charge of linking the latent space (space of embeddings) to the proper output layer. The latter is chosen depending on the type of the action space (is it discrete or continuous? How many available actions are there?) of the selected environment.

First of all, the user needs to import the Baselines register, which allows them to define a custom network and register it with a user-defined name. Then, they can define a custom deep learning model in the form of a function using a custom architecture. In this way, we are able to change the policy network architecture at will, testing different solutions to find the best one for a specific problem. A practical example will be presented in the exercise in the following section.

Now, we are ready to train our first RL agent and solve a classic control problem.