Policy iteration

Unlike value iteration, in policy iteration we start with the random policy, then we find the value function of that policy; if the value function is not optimal then we find the new improved policy. We repeat this process until we find the optimal policy.

There are two steps in policy iteration:

  1. Policy evaluation: Evaluating the value function of a randomly estimated policy.
  2. Policy improvement: Upon evaluating the value function, if it is not optimal, we find a new improved policy:

The steps involved in the policy iteration are as follows:

  1. First, we initialize some random policy
  2. Then we find the value function for that random policy and evaluate to check if it is optimal which is called policy evaluation
  3. If it is not optimal, we find a new improved policy, which is called policy improvement
  4. We repeat these steps until we find an optimal policy

Let us understand intuitively by performing policy iteration manually step by step.

Consider the same grid example we saw in the section Value iteration. Our goal is to find the optimal policy:

  1. Initialize a random policy function.

Let us initialize a random policy function by specifying random actions to each state:

say A -> 0

       B -> 1

       C -> 0

  1. Find the value function for the randomly initialized policy.

Now we have to find the value function using our randomly initialized policy. Let us say our value function after computation looks like the following:

 

Now that we have a new value function according to our randomly initialized policy, let us compute a new policy using our new value function. How do we do this? It is very similar to what we did in Value iteration. We calculate Q value for our new value function and then take actions for each state which has a maximum value as the new policy.

Let us say the new policy results in:

A - > 0

B - > 1

C -> 1

We check our old policy, that is, the randomly initialized policy, and the new policy. If they are same, then we have attained the convergence, that is, found the optimal policy. If not, we will update our old policy (random policy) as a new policy and repeat from step 2.

Sound confusing? Look at the pseudo code:

policy_iteration():
Initialize random policy
for i in no_of_iterations:
Q_value = value_function(random_policy)
new_policy = Maximum state action pair from Q value
if random_policy == new policy:
break
random_policy = new_policy
return policy