The REINFORCE Algorithm aka Monte-Carlo Policy Differentiation

The setup for the general reinforcement learning problem is as follows. We're given an environment $\mathcal{E}$ with a specified state space $\mathcal{S}$ and an action space $\mathcal{A}$ giving the allowable actions in each of those states. Each action $a_t$ taken in a specific state $s_t$ yields a particular reward $r_t = r(s_t, a_t)$ based off a reward function $r$ that's in some way implicitly defined by the environment. We'd like to choose a policy $\pi$ giving a probability distribution of actions over states $\pi: \mathcal{S} \times \mathcal{A} \to [0, 1]$. In other words, $\pi(a_t \mid s_t)$ gives the probability of taking action $a_t$ in state $s_t$.

Since the whole problem of RL boils down to formulating an optimal policy which maximizes reward, we can define an objective function that explicitly quantifies how a policy fares in accomplishing this goal. First, we assume that we are using some sort of function approximator (e.g. a neural network) to obtain an approximation to the policy $\pi$ and we assume also that this approximator is governed by some set of parameters $\theta$. We say that the policy $\pi_\theta$ is parametrized by $\theta$.

Define our objective function $J$ by $$J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right]$$ In shorthand, our objective function returns the expected reward achieved by a given policy $\pi_\theta$ over some time horizon governed by $t$ (can be either finite or infinite). We write $\tau \sim p_\theta(\tau)$ to indicate that we're sampling trajectories $\tau$ from the probability distribution of our policy approximator governed by $\theta$. This distribution can be calculated by decomposing into a product of conditional probabilities, i.e. $$p_\theta(\tau) = p_\theta(s_1, a_1, \ldots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$$ We can now specify the optimal policy $\pi^* = \pi_{\theta^*} = arg\,max_\theta J(\theta)$.

That's all well and good, but the question becomes, how do we break down our objective function to something tractable. We need a way to accurately approximate that expectation, which in its exact form involves an integral over a probability distribution defined by our parametrized policy which we don't have access to. To do this, we can use something called Monte-Carlo Approximation. The idea is simple and is predicated on the following fact $$\lim_{N \to \infty} \frac{1}{N}\sum_{i=1}^N f(x_i)_{x_i \sim p(x)} = \mathbb{E}[f(x)]$$ Thus, if we sample $f(x_i)$, drawing $x_i$ from the probability distribution $p(x)$, $N$ times where $N$ is large but finite, we obtain a decent approximation to $\mathbb{E}[f(x)]$. Using Monte-Carlo approximation, we can rewrite our objective function as $$J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t} r(s_{i, t}, a_{i, t})$$ where the $N$ samples are being directly drawn from the probability distribution defined by $\pi_\theta$ simply by running $\pi_\theta$, $N$ times.

Now that we have a tractable objective function, we still need to determine how best to iteratively permute our $\theta$ parameter values so as to arrive at the optimal setting $\theta^*$. The simplest approach is to perform gradient ascent on $J(\theta)$ (since we're taking $arg\,max$ over $\theta$). That means it's time to take some gradients. To simplify notation a bit, define the reward of a trajectory $\tau$ as $$r(\tau) = \sum_{t} r(s_t, a_t)$$ Then we can rewrite $J(\theta)$ as $$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}[r(\tau)] = \int \pi_\theta(\tau) r(\tau)\ d\tau$$ To get our gradient ascent update formula, we take the gradient of $J$ with respect to $\theta$ to get $$\nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau)r(\tau)\ d\tau$$ To get rid of the intractable integral, we can use a clever substitution. Note that $$\nabla_\theta \pi_\theta(\tau) = \pi_\theta \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} = \pi_\theta(\tau) \nabla_\theta \log{\pi_\theta(\tau)}$$ Plugging that into our expression for $\nabla_\theta J(\theta)$ gives $$\nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau) r(\tau)\ d\tau = \int \pi_\theta(\tau) \nabla_\theta \log{\pi_\theta} (\tau) r(\tau)\ d\tau = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} [\nabla_\theta \log{\pi_\theta(\tau)r(\tau)}]$$ We've turned a gradient of an expectation into an expectation of a gradient, which is pretty cool, but we need to reduce things even further. Recall from before that the probability distribution $\pi_theta$ defines over trajectories $\tau$ is given as $$\pi_\theta(\tau = s_1, a_1, \ldots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta (a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$$ Taking logs of both sides breaks this down into a nice, convenient sum. $$\log \pi_\theta(\tau) = \log p(s_1) + \sum_{t} \log \left[ \pi_\theta(a_t \mid s_t) + \log p(s_{t+1} \mid s_t, a_t) \right]$$ Note that $\nabla_\theta \log \pi_\theta(\tau)$ is a term in our revised expression for $\nabla_\theta J(\theta)$ so we'd like to take the gradient of the previous formula and substitute that back in. \begin{align*} \nabla_\theta \log{\pi_\theta(\tau)} &= \nabla_\theta \left[ \log{p(s_1)} + \sum_{t} \log{\pi_\theta(a_t \mid s_t)} + \log{p(s_{t+1} \mid s_t, a_t)} \right] \\ &= \sum_{t} \nabla_\theta \log{\pi_\theta(a_t \mid s_t)} \end{align*} We were able to eliminate all but the middle term because the others did not depend on $\theta$.

Finally, we can plug this whole thing back into our expression for $\nabla_\theta J(\theta)$ to get $$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$ Once again, we're left with an expectation. Like before, we can use Monte-Carlo Approximation to reduce this to a summation over samples. This gives $$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_{i, t} \mid s_{i,t})\right) \left(\sum_t r(s_{i,t}, a_{i,t})\right)\right]$$ Now that we've finally reduced our expression to a usable form, we can update $\theta$ at each timestep according to the gradient ascent update rule $$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$ Now that we've derived our update rule, we can present the pseudocode for the REINFORCE algorithm in it's entirety.

The REINFORCE Algorithm
  1. Sample trajectories $\{\tau_i\}_{i=1}^N from \pi_{\theta}(a_t \mid s_t)$ by running the policy.
  2. Set $\nabla_\theta J(\theta) = \sum_i (\sum_t \nabla_\theta \log \pi_\theta(a^i_t \mid s^i_t)) (\sum_t r(s^i_t, a^i_t))$
  3. $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
And that's it. While the derivation of the gradient update rule was relatively complex, the three-step algorithm is itself conceptually simple. In upcoming tutorials, I'll identify how to improve the REINFORCE algorithm with strategies which minimize variance.