The Problem with Policy Gradient

If you’ve read my article about the REINFORCE algorithm, you should be familiar with the update that’s typically used in policy gradient methods.

$$\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \left( \sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$

It’s an extremely elegant and theoretically satisfying model that suffers from only one problem - it doesn’t work well in practice. Shocking, I know! Jokes abound about the flimsiness that occurs when policy gradient methods are applied to practical problems. One such joke goes like this: if you’d like to reproduce the results of any sort of RL policy gradient method as reported in academic papers, make sure you contact the authors and get the settings they used for their random seed. Indeed, sometimes policy gradient can feel like nothing more than random search dressed up in mathematical formalism. The reasons for this are at least threefold (I won’t rule out the possibility that there are more problems with this method of which I’m not yet aware), namely that

  1. Policy gradient is high variance.
  2. Convergence in policy gradient algorithms is sloooow.
  3. Policy gradient is terribly sample inefficient.

I’ll walk through each of these in reverse because flouting the natural order of things is fun. 🙂

Sample Inefficiency

In order to get anything useful out of policy gradient, it’s necessary to sample from your policy and observe the resultant reward literally millions of times. Because we’re sampling directly from the policy we’re optimizing, we say that policy gradient is an on-policy algorithm. If you take a look at the formula for the gradient update, we’re calculating an expectation and we’re doing that in the Monte Carlo way, by averaging over a number of trial runs. Within that, we have to sum over all the steps in a single trajectory which itself could be frustratingly expensive to run depending on the nature of the environment you’re working with. So we’re iterating sums over sums, and the result is that we incur hugely expensive computational costs in order to acquire anything useful. This works fine in the realms where policy gradient has been successfully applied. If all you’re interested in is training your computer to play Atari games, then policy gradient might not be a terrible choice. However, imagine using this process in anything remotely resembling a real-world task, like training a robotic arm to perform open-heart surgery, perhaps? Hello, medical malpractice lawsuits. However, sample inefficiency is not a problem that’s unique to policy gradient methods by any means. It’s an issue that plagues many different RL algorithms, and addressing this is key to generating a model that’s useful in the real world. If you’re interested in sample efficient RL algorithms, check out the work that’s being done at Microsoft Research.

Slow Convergence

This issue pretty much goes hand in hand with the sample inefficiency discussed above and the problem of high variance to be discussed below. Having to sample entire trajectories on-policy before each gradient update is slow to begin with, and the high variance in the updates makes the search optimization highly inefficient which means more sampling which means more updates, ad infinitum. We’ll discuss some remedies for this in the next section.

High Variance

The updates made by the policy gradient are very high variance. To get a sense for why this is, first considering that in RL we’re dealing with highly general problems such as teaching a car to navigate through an unpredictable environment or programming an agent to perform well across a diverse set of video games. Therefore, when we’re sampling multiple trajectories from our untrained policy we’re bound to observe highly variable behaviors. Without any a priori model of the system we’re seeking to optimize, we begin with a policy whose distribution of actions over a given state is effectively uniform. Of course, as we train the model we hope to shape the probability density so that it’s unimodal on a single action, or possibly multimodal over a few successful actions that can be taken in that state. However, acquiring this knowledge requires our model to observe the outcomes of many different actions taken in many different states. This is made exponentially worse in continuous action or state spaces as visiting even close to every state-action pair is computationally intractable. Due to the fact that we’re using Monte Carlo estimates in policy gradient, we trade off between computational feasibility and gradient accuracy. It’s a fine line to walk, which is why variance reduction techniques can potentially yield huge payoffs.

Another way to think about the variance introduced into the policy gradient update is as follows: at each time step in your trajectory you’re observing some stochastic event. Each such event has some noise, and the accumulation of even a small amount of noise across a number of time steps results in a high variance outcome. Yet, understanding this allows us to suggest some ways to alter policy gradient so that the variance might ultimately be reduced.

Improvements to Policy Gradient

Reward to Go

The first “tweak” we can use is incredibly simple. Let’s take a look again at that policy gradient update.

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$

If we break it down into the Monte Carlo estimate, we get

$$\nabla_\theta J(\theta) = \frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_{t=1}^T r(s_t, a_t)\right)\right]$$

If we distribute $\sum_{t=1}^T r(s_t, a_t)$ into the left innermost sum involving $\nabla \log \pi_{\theta}$, we see that we’re taking the gradient of $\log \pi_\theta$ at a given time step $t$ and weighting it by the sum of rewards at all timesteps. However, it would make a lot more sense to simply reweight this gradient by the rewards it affects. In other words, the action taken at time $t$ can only influence the rewards accrued at time $t$ and beyond. To that end, we replace $\sum_{t=1}^T r(s_t, a_t)$ in the gradient update with the partial sum $\sum_{t’=t}^T r(s_{t’}, a_{t’})$ and call this quantity $\hat{Q}_{t}$ or the “reward to go”. This quantity is closely related to the $Q$ function, hence the similarity in notation. For clarity, the entire policy gradient update now becomes

$$\frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_{t=t’}^T r(s_{t’}, a_{t’})\right)\right]$$

Baselines

The next technique for reducing variance is not quite as obvious but still yields great results. If you think about how policy gradient works, you’ll notice that how we take our optimization step depends heavily on the reward function we choose. Given a trajectory $\tau$, if we have a negative return $r(\tau) = \sum_{t} r(s_t, a_t)$ then we’ll actually take a step in the direction opposite the gradient, which should have the effect of lessening the probability density on the trajectory. For those trajectories that have positive return, their probability density will increase. However, if we do something as simple as setting $r(\tau) = r(\tau) + b$ where $b$ is a sufficiently large constant such that the return for $r(\tau)$ is now positive, then we will actually increase the probability weight on $\tau$ even though $\tau$ still fares worse than other trajectories with previously positive return. Given how sensitive the model is to the shifting and scaling of the chosen reward function, it’s natural to ask whether we can find an optimal $b$ such that (note: we’re using trajectories here so some of the sums from the original PG formulation are condensed)

$$\frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) [r(\tau_i) - b]$$

has minimum variance. We call such a $b$ a baseline. We also want to ensure that subtracting $b$ in this way doesn’t bias our estimate of the gradient. Let’s do that first. Recall the identity we used in the original policy gradient derivation

$$\pi_\theta(\tau) \nabla \log \pi_\theta(\tau) = \nabla \pi_\theta(\tau)$$

To show that our estimator remains unbiased, we need to show that $$\mathbb{E}\left[\nabla \log \pi_\theta(\tau_i)[r(\tau_i) - b]\right] = \mathbb{E} [\nabla \log \pi_\theta(\tau_i)]$$ We can equivalently show that $\mathbb{E} [\nabla \log \pi_\theta(\tau_i) b]$ is equal to zero. We have

$$ \begin{align*} \mathbb{E} [\nabla \log \pi_\theta(\tau_i) b] &= \int \pi_\theta(\tau_i) \nabla \log \pi_\theta(\tau_i) b \ d\tau_i \\ &= \int \nabla \pi_\theta(\tau_i) b \ d\tau_i \\ &= \nabla b \int \pi_\theta(\tau_i) \ d\tau_i \\ &= \nabla b 1 \\ &= 0 \end{align*} $$

where we use the fact that $\int \pi_\theta(\tau_i) \ d\tau_i$ is 1 because $\pi_\theta$ is a probability distribution. Therefore, our baseline enhanced version of the policy gradient remains unbiased.

The question then becomes, how do we choose an optimal setting of $b$. One natural candidate is the average reward $b = \frac{1}{N} \sum_{i=1}^N r(\tau_i)$ over all trajectories in the simulation. In this case, our returns are “centered”, and returns that are better than average end up being positively weighted whereas those that are worse are negatively weighted. This actually works quite well, but it is not, in fact, optimal. To calculate the optimal setting, let’s look at the policy gradient’s variance. In general, we have

$$ \begin{align} Var[x] &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \\ \nabla J(\theta) &= \mathbb{E}_{\tau \sim \pi_{\theta(\tau)}} \left[ \nabla \log \pi_{\theta(\tau)} (r(\tau) - b)\right] \\ Var[\nabla J(\theta)] &= \mathbb{E}_{\tau \sim \pi_{\theta(\tau)}} \left[(\nabla \log \pi_{\theta(\tau)} (r(\tau) - b))^2\right] - \\ &\mathbb{E}_{\tau \sim \pi_{\theta(\tau)}} \left[ \nabla_\theta \log \pi_{\theta(\tau)} (r(\tau) - b)\right]^2
\end{align} $$

The rightmost term in this expression is just the square of the policy gradient, which for the purposes of optimizing $b$ we can ignore since baselines are biased in expectation. Therefore, we turn our attention to the left term. To simplify notation, we can write $$g(\tau) = \nabla \log \pi_\theta(\tau)$$ Then we take the derivative to get $$ \begin{align*} \frac{dVar}{db} &= \frac{d}{db} \mathbb{E}\left[ g(\tau)^2(r(\tau) - b)^2\right] \\ &= \frac{d}{db}(\mathbb{E}[g(\tau)^2r(\tau)^2] - 2 \mathbb{E}[g(\tau)^2r(\tau)b] + b^2\mathbb{E}[g(\tau)^2]) \\ &= 0 -2\mathbb{E}[g(\tau)^2r(\tau)] + 2b\mathbb{E}[g(\tau)^2] \end{align*} $$ Solving for $b$ in the final equation gives

$$b = \frac{\mathbb{E}[g(\tau)^2r(\tau)]}{\mathbb{E}[g(\tau)^2]}$$

In other words, the optimal setting for $b$ is to take the expected reward but reweight it by expected gradient magnitudes.

Conclusion

Hopefully this provided you with a good overview as to how you can improve implementations of policy gradient to speed up convergence and reduce variance. In a future article, I’ll discuss how to derive an off-policy version of policy gradient which improves sample efficiency and speeds up convergence.

Daniel McNeela
Daniel McNeela
Machine Learning Researcher and Engineer

My research interests include patent and IP law, geometric deep learning, and computational drug discovery.