The REINFORCE Algorithm aka Monte-Carlo Policy Differentiation

The setup for the general reinforcement learning problem is as follows. We're given an environment $\mathcal{E}$ with a specified state space $\mathcal{S}$ and an action space $\mathcal{A}$ giving the allowable actions in each of those states. Each action $a_t$ taken in a specific state $s_t$ yields a particular reward $r_t = r(s_t, a_t)$ based off a reward function $r$ that's in some way implicitly defined by the environment. We'd like to choose a policy $\pi$ giving a probability distribution of actions over states $\pi: \mathcal{S} \times \mathcal{A} \to [0, 1]$. In other words, $\pi(a_t \mid s_t)$ gives the probability of taking action $a_t$ in state $s_t$.

Since the whole problem of RL boils down to formulating an optimal policy which maximizes reward, we can define an objective function that explicitly quantifies how a policy fares in accomplishing this goal. First, we assume that we are using some sort of function approximator (e.g. a neural network) to obtain an approximation to the policy $\pi$ and we assume also that this approximator is governed by some set of parameters $\theta$. We say that the policy $\pi_\theta$ is parametrized by $\theta$.

Define our objective function $J$ by $$J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right]$$ In shorthand, our objective function returns the expected reward achieved by a given policy $\pi_\theta$ over some time horizon governed by $t$ (can be either finite or infinite). We write $\tau \sim p_\theta(\tau)$ to indicate that we're sampling trajectories $\tau$ from the probability distribution of our policy approximator governed by $\theta$. This distribution can be calculated by decomposing into a product of conditional probabilities, i.e. $$p_\theta(\tau) = p_\theta(s_1, a_1, \ldots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$$ We can now specify the optimal policy $\pi^* = \pi_{\theta^*} = arg\,max_\theta J(\theta)$.

That's all well and good, but the question becomes, how do we break down our objective function to something tractable. We need a way to accurately approximate that expectation, which in its exact form involves an integral over a probability distribution defined by our parametrized policy which we don't have access to. To do this, we can use something called Monte-Carlo Approximation. The idea is simple and is predicated on the following fact $$\lim_{N \to \infty} \frac{1}{N}\sum_{i=1}^N f(x_i)_{x_i \sim p(x)} = \mathbb{E}[f(x)]$$ Thus, if we sample $f(x_i)$, drawing $x_i$ from the probability distribution $p(x)$, $N$ times where $N$ is large but finite, we obtain a decent approximation to $\mathbb{E}[f(x)]$. Using Monte-Carlo approximation, we can rewrite our objective function as $$J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t} r(s_{i, t}, a_{i, t})$$ where the $N$ samples are being directly drawn from the probability distribution defined by $\pi_\theta$ simply by running $\pi_\theta$, $N$ times.

Now that we have a tractable objective function, we still need to determine how best to iteratively permute our $\theta$ parameter values so as to arrive at the optimal setting $\theta^*$. The simplest approach is to perform gradient ascent on $J(\theta)$ (since we're taking $arg\,max$ over $\theta$). That means it's time to take some gradients. To simplify notation a bit, define the reward of a trajectory $\tau$ as $$r(\tau) = \sum_{t} r(s_t, a_t)$$ Then we can rewrite $J(\theta)$ as $$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}[r(\tau)] = \int \pi_\theta(\tau) r(\tau)\ d\tau$$ To get our gradient ascent update formula, we take the gradient of $J$ with respect to $\theta$ to get $$\nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau)r(\tau)\ d\tau$$ To get rid of the intractable integral, we can use a clever substitution. Note that $$\nabla_\theta \pi_\theta(\tau) = \pi_\theta \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} = \pi_\theta(\tau) \nabla_\theta \log{\pi_\theta(\tau)}$$ Plugging that into our expression for $\nabla_\theta J(\theta)$ gives $$\nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau) r(\tau)\ d\tau = \int \pi_\theta(\tau) \nabla_\theta \log{\pi_\theta} (\tau) r(\tau)\ d\tau = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} [\nabla_\theta \log{\pi_\theta(\tau)r(\tau)}]$$ We've turned a gradient of an expectation into an expectation of a gradient, which is pretty cool, but we need to reduce things even further. Recall from before that the probability distribution $\pi_theta$ defines over trajectories $\tau$ is given as $$\pi_\theta(\tau = s_1, a_1, \ldots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta (a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$$ Taking logs of both sides breaks this down into a nice, convenient sum. $$\log \pi_\theta(\tau) = \log p(s_1) + \sum_{t} \log \left[ \pi_\theta(a_t \mid s_t) + \log p(s_{t+1} \mid s_t, a_t) \right]$$ Note that $\nabla_\theta \log \pi_\theta(\tau)$ is a term in our revised expression for $\nabla_\theta J(\theta)$ so we'd like to take the gradient of the previous formula and substitute that back in. \begin{align*} \nabla_\theta \log{\pi_\theta(\tau)} &= \nabla_\theta \left[ \log{p(s_1)} + \sum_{t} \log{\pi_\theta(a_t \mid s_t)} + \log{p(s_{t+1} \mid s_t, a_t)} \right] \\ &= \sum_{t} \nabla_\theta \log{\pi_\theta(a_t \mid s_t)} \end{align*} We were able to eliminate all but the middle term because the others did not depend on $\theta$.

Finally, we can plug this whole thing back into our expression for $\nabla_\theta J(\theta)$ to get $$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$ Once again, we're left with an expectation. Like before, we can use Monte-Carlo Approximation to reduce this to a summation over samples. This gives $$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_{i, t} \mid s_{i,t})\right) \left(\sum_t r(s_{i,t}, a_{i,t})\right)\right]$$ Now that we've finally reduced our expression to a usable form, we can update $\theta$ at each timestep according to the gradient ascent update rule $$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$ Now that we've derived our update rule, we can present the pseudocode for the REINFORCE algorithm in it's entirety.

The REINFORCE Algorithm

Sample trajectories $\{\tau_i\}_{i=1}^N from \pi_{\theta}(a_t \mid s_t)$ by running the policy.
Set $\nabla_\theta J(\theta) = \sum_i (\sum_t \nabla_\theta \log \pi_\theta(a^i_t \mid s^i_t)) (\sum_t r(s^i_t, a^i_t))$
$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

And that's it. While the derivation of the gradient update rule was relatively complex, the three-step algorithm is itself conceptually simple. In upcoming tutorials, I'll identify how to improve the REINFORCE algorithm with strategies which minimize variance.

Daniel McNeela

Home

Resume

Portfolio

Mathematics

Machine Learning

Biotech

Books

Bikes

The REINFORCE Algorithm aka Monte-Carlo Policy Differentiation