The Personal Website of Daniel McNeelaThe personal blog, resume, and portfolio of Daniel McNeela. I am an aspiring computer scientist, mathematician, and writer.
http://mcneela.github.io/
Fri, 24 Apr 2020 22:58:21 +0000Fri, 24 Apr 2020 22:58:21 +0000Jekyll v3.8.5Subgradient Descent<p><i>This post was originally featured on my other blog at <a href="https://flamethrower.ai/blog/206669/subgradient-descent">Flamethrower AI</a></i></p>
<h1>Introduction</h1>
<p>You've probably heard of the gradient descent algorithm. It's probably the most widely used algorithm
in machine learning, used to train everything from neural networks to logistic regression. What you probably
didn't know though, is that it doesn't always work. That's right, gradient descent has some preconditions that
your loss function needs to satisfy in order for the algorithm to run. One of these is that the loss function
must be differentiable. That means that when you need to optimize a loss function that's not differentiable,
such as the L1 loss or hinge loss, you're flat out of luck.</p>
</br>
<p><i>Or are you?</i> Thankfully, you're actually not. There's a little-known method that's been around for a long
time within the field of convex optimization that uses the notion of <i>subgradients</i> to perform optimization.
We're going to tackle the theory behind it in this post today, then build our own implementation of it in Python
in order to create our own soft-margin SVM classifier with hinge loss to classify text messages as spam or ham.
Let's get started.</p>
<h1>The Theory</h1>
<h2>The Definition of a Subgradient</h2>
<p>A subgradient is a generalization of the gradient which is useful for non-differentiable, but convex, functions.
The basic idea is this: When a function is convex, you can draw a line (or in higher dimensions, a hyperplane)
through each point $f(x)$ and that line will
underapproximate the function everywhere. This is baked into the definition of convexity. In fact, for many points
$x$ there is more than one such line. A subgradient is simply any one of these lines, and it is defined mathematically
as
$$g \in \mathbb{R}^n \text{ such that } f(z) \geq g^\top (z - x) \text{ for all } z \in \text{dom}(f)$$
The definition can be a little bit confusing, so let me break it down piece by piece. The vector $g$ is the subgradient
and it's also what's called a <i>normal vector</i>.
It lies perpendicular to the hyperplane which underapproximates $f(z)$. The normal vector is all that's needed to define
the angle of the hyperplane, and by varying $z$, we can land at any point on this hyperplane.</p>
</br>
<p>Another way of characterizing the subgradient $g$ is via what's called the <i>epigraph</i> of $f$. The epigraph of $f$
is just all the area above and including the graph of $f$. In other words, it's the set you'd get if you colored in above
the graph of $f$. We say that $g$ is a subgradient if the hyperplane it defines <i>supports</i> the epigraph of $f$.
</p>
</br>
<p>When we say "support" here, it has a particular meaning and is not to be confused with the support of a function in an
analysis sense. More concretely, a hyperplane is said to support a set $S$ if it intersects $S$ in at least one point
(for our purpose, this point is $f(x)$) and the remainder of the entire set $S$ lies on one side of the hyperplane.
Because the epigraph of a convex function lies above its subgradient at every point, this is a natural way to conceptualize
things.</p>
</br>
<h2>Subdifferentials</h2>
<p>Now that we've introduced subgradients, lets move onto a broader concept, that of a <i>subdifferential</i>. We say that
a function $f$ is <i>subdifferentiable</i> if for each point $x \in \text{dom}(f)$ there exists a subgradient $g_x$ at
that point. Now, its important to note that the subgradient $g_x$ for a point $x$ is not unique. In fact, for any point
$x$, there can exist either 0, 1, or infinitely many such subgradients. We call the set of all such subgradients at a point
$x$ the <i>subdifferential</i> of $f$ at that point, and we denote it as $\partial f(x)$. In other words,
$$\partial f(x) = \{g_x \mid f(z) \geq g_x^\top (z - x)\quad \forall z \in \text{dom}(f)\}$$
So why talk about subdifferentials? Well, they have some nice properties and integrate nicely with the regular theory of
differentiation. Allow me to explain.</p>
</br>
<p>First off, the subdifferential $\partial f(x)$ associated with a point $x$ is <b>always</b> a closed, convex set. This
follows from the fact that it can be viewed as the intersection of a set of infinite half-spaces.
$$\partial f(x) = \bigcap_{z \in \text{dom}(f)} \{g_x \mid f(z) \geq g_x^\top (z-x)\quad \forall z \in \text{dom}(f)\}$$
This means the problem of finding the optimal subgradient $g_x$ at a given point $x$ is a solvable one, because doing
optimization over convex sets is a solved problem.</p>
</br>
<p>The more important point, though, is that the subdifferential satisfies many of the same laws of calculus as the
standard differential. Concretely, much like regular differentials, we can
<ul>
<li>Scale differentials by constants - $\partial f(\alpha x) = \alpha \partial f(x)$</li>
<li>Distribute over sums, integrals, and expectation - $\partial \int f(x)\ dx = \int \partial f(x)\ dx$</li>
</ul>
</p>
</br>
<p>Another important point is the direct connection of gradients and subdifferentials. If a function $f$ is differentiable
at a point $x$, then there exists only one subgradient $g_x$ to $f$ at $x$ and for the subdifferential we have
$$\partial f(x) = \{g_x = \nabla f(x)\}$$
In other words, there is a direct correspondence between and uniqueness of gradients and subgradients in the case where
$f$ is differentiable. Thus in some sense, the subgradient is a generalization of the gradient that applies in more general situations,
(e.g. nondifferentiability).</p>
<h2>Enough with the Definitions Already</h2>
<p>Okay, let's move on from tedious definitions and see how we can start using subgradients to solve a problem that's
typically intractable with standard, gradient-based methods. For this, let's take a look at the $\ell_1$ norm.</p>
</br>
<p>The $\ell_1$ norm is defined as
$$\|x\|_1 = \sum_{i=1}^n |x_i|$$
It is a nondifferentiable and convex function of $x$. This means that while we can't tackle its optimization with gradient descent,
we can approach it using the method of subgradients. Let's see how that works.</p>
</br>
<p>Before we get started, we need to recite a quick fact about subgradients. Namely, if we have a sequence of functions $f_1, \ldots, f_n$
where each of the $f_i$ are convex and subdifferentiable. Then their pointwise maximum $f(x)$, defined as
$$f(x) = \max_{i} f_i(x)$$
is also subdifferentiable and has
$$\partial f(x) = {\bf Co} \left(\bigcup \{\partial f_i(x) \mid f_i(x) = f(x)\} \right)$$
In other words, to find the subdifferential of the max, $f(x)$, we take the convex hull of the union of subdifferentials
for each function $f_i(x)$ which attains the pointwise maximum of $\{f_1, \ldots, f_n\}$ at each point $x$.
It's a little bit of a convoluted and long-winded statement, but hopefully it will make more sense when we apply it to the
problem of calculating the subgradient of the $\ell_1$ norm.</p>
</br>
<h3>Applying the Rule</h3>
<p>In order to apply this rule of maximums, we have to find a way to rewrite the $\ell_1$ norm as a pointwise maximum of
convex, subdifferentiable functions. This is actually pretty easy to do. To see why, consider the vector
$$s = [s_1, \ldots, s_n] \quad s_i \in \{-1, 1\}$$
where each of the $s_i$ is either equal to 1 or -1. The idea is you could set $s_i = -1$ when $x_i < 0$ and $s_i = 1$
when $x_i > 0$. When $x_i = 0$, either $s_i = 1$ or $s_i = -1$ can be used. Thus, we have
$$\ell_1(x) = \max \{s^\top x | s_i \in \{-1, 1\} \}$$
Since we've rewritten the $\ell_1$ norm as a max of functions, we can now apply the rule to find its subgradient.
Since each of the $f_i$ just have the form $s^T x$, they are each differentiable with unique (sub)gradient $s$.
Therefore the subgradient of $f = \max_i f_i$ is just the convex hull of this set of subgradients, which can be
expressed as
$$\{g \mid \|g\|_\infty \leq 1, g^Tx = \|x\|_1\}$$
</p>
<h2>The Subgradient Method</h2>
<p>Now, we arrive at our culminating theoretical moment, that of applying subgradients to problems of convex optimization.
We do so using what's called the <i>subgradient method</i> which looks almost identical to gradient descent. The algorithm
is an iteration which asserts that we make steps according to
$$x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)}$$
where $\alpha_k$ is our learning rate. There are a few key differences when compared with gradient descent though. The first
of these is that our $\alpha_k$ must be fixed in advance and not determined dynamically on the fly. The next is that the subgradient
method is not a true descent method. This is because each subsequent step $x^{(k+1)}$ is not guaranteed to decrease our objective
function value. In general, we keep track of our objective values and pick their max at the end of iteration, i.e. we set
$$f^{\star} = \max_k \{f^{(k)} = f(x^{(k)})\}$$</p>
<p>To see why the objective function value is not guaranteed to decrease with each iterate, we need only look to the definition
of a subgradient. Namely, we have
$$f(x^{(k+1)}) = f(x^{(k)} - \alpha_k g^{(k)}) \geq f(x^{(k)}) + {g^{(k)}}^\top(x^{(k)} - \alpha_k g^{(k)} - x^{(k)}) =
f(x^{(k)}) - g^{(k)}\alpha_k g^{(k)} = f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$$
Since $f(x^{(k+1)})$ can be any value greater than $f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$ including $f(x^{(k)}) + C$ for
some positive constant $C$, it's clear that the subgradient method is not a guaranteed descent method, and the fact that
the decrease in $f(x)$ at each time step is bounded below by $f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$ speaks to some of the
slowness of the method. That said, the algorithm is guaranteed to converge to within $\epsilon$ of the optimum in some semi-reasonable
amount of time, so it's not all bad. One just needs to keep in mind the caveats of the method.</p>
<h1>Let's Get Practical</h1>
<p>Now that we've finally elucidated most of the math behind subgradients, let's start applying them to a real world problem. In this
exercise, we'll devise a soft-margin SVM to classify text messages as either "spam" or "ham". We'll then train it using hinge loss
and the subgradient method and implement the entire thing in Python code, from scratch. Let's get started.</p>
<h2>An SVM-like Classifier</h2>
<p>I'm not going to go into the entire theory of SVMs as that is worthy of a series of posts all on its own. However, I will walk through
the basics of how the classifier will work, and then jump into the coding of the subgradient algorithm.</p>
<p>In short, an SVM uses a hyperplane of the form $\mathbf{w}\mathbf{x} + b$ to classify points of a training set into one of two classes.
Points that lie to the left of the hyperplane are assigned to one class, whereas points that lie to the right are assigned to the other.
Mathematically, this classification can be represented as the function
$$y = \text{sign}(\mathbf{w}\mathbf{x} + b)$$
The model is trained in such a way as to maximize the <i>margin</i> that the hyperplane decision boundary produces. Roughly speaking,
this is the spacing between the decision boundary and the closest point to either side.</p>
<p>When the data in the training set is linearly separable, we can classify it perfectly using an SVM decision boundary and the definition
of margin. However, rarely in life does our data come to us with perfect separation between the classes. As such, we give the opportunity
to the SVM to make some mistakes while still aiming to classify most of the points correctly and maximize margin. For this purpose, we introduce
<i>slack variables</i> and call the resultant model a <i>soft-margin SVM</i>. To learn more about all of these architecture components,
I encourage you to read any of the many great articles available online or in textbooks.</p>
<p>In this lesson, we're going to train a variant of a soft-margin SVM. Soft-margin SVMs are trained using the <i>hinge loss</i> which is defined
mathematically as
$$\ell(y, t) = \max (0, 1 - ty)$$
where $y = \mathbf{w}{x} + b$ is our model's prediction and $t$ is the target output value. This loss function is not differentiable at $0$, so
you know what that means? That's right, it's time for the subgradient method to shine! To use it, we need to calculate the subgradient of this loss function.</p>
<h3>The Hinge Loss Subgradient</h3>
<p>In order to train the model via the subgradient method we'll need to know what the subgradients of the hinge loss actually are. Let's calculate
that now. Since the hinge loss is piecewise differentiable, this is pretty straightforward. We have
$$\frac{\partial}{\partial w_i} (1 - t(\mathbf{w}\mathbf{x} + b)) = -tx_i$$
and
$$\frac{\partial}{\partial w_i} \mathbf{0} = \mathbf{0}$$
The first subgradient holds for $ty < 1$ and the second holds otherwise.</p>
<h3>The Code</h3>
<p>Okay, now with the math out of the way, let's get to some of the code. For this task we'll be classifying text messages as "ham" or "spam" using
the data available <a href="https://www.kaggle.com/uciml/sms-spam-collection-dataset">here</a>. Download the file "spam.csv" and extract it to a
location of your choice. I created a subfolder in my code directory called <code>data</code> and placed it there.</p>
<h4>Loading the Data</h4>
<p>Let's create a function to load and transform the data. We'll use the scikit-learn <code>CountVectorizer</code> class to create
a bag-of-words representation for each input text message in the training data set. We'll also use some of the <code>gensim</code>
preprocessing utilities to help clean up the inputs. I created the following functions which do all of the above.</p>
<pre><code class="language-python">import numpy as np
from gensim.parsing.preprocessing import preprocess_string
from sklearn.feature_extraction.text import CountVectorizer
def clean_text(l):
fields = l.strip().split(',')
return preprocess_string(fields[1])
def get_texts(lines):
return list(map(clean_text, lines))
def convert_label(l, hs_map):
fields = l.strip().split(',')
key = preprocess_string(fields[0])[0]
return hs_map[key]
def get_labels(lines, hs_map):
return list(map(lambda x: convert_label(x, hs_map), lines))
def load_data(file='data/spam.csv'):
lines = open(file, 'r', encoding='ISO-8859-1').readlines()
lines = lines[1:] # remove header line
hs_map = {'ham': 1, 'spam': -1}
y = get_labels(lines, hs_map)
texts = get_texts(lines)
texts = [' '.join(x) for x in texts]
bow = CountVectorizer()
X = bow.fit_transform(texts)
return X, np.array(y)</code></pre>
<p>We make generous use of the Python <code>map</code> functionality to map various preprocessing functionality across our strings. The actual
bag of words vectorization is a simple one-line call thanks to sklearn's <code>CountVectorizer</code>. Note that <code>CountVectorizer</code>
returns a scipy <code>sparse matrix</code> which will introduce some caveats into our training code. More on that in a bit.</p>
<p>Next, let's start implementing the loss functions. This is pretty simple with <code>numpy</code>. We have,</p>
<pre><code class="language-python">def hinge_loss(t, y):
return np.maximum(0, 1 - t * y)
def hinge_subgrad(t, y, x):
if t * y < 1:
subgrad = (-t * x).toarray()
else:
subgrad = np.zeros(x.shape)
return subgrad</code></pre>
<p>We have to call the <code>.toarray()</code> method on the first clause of <code>hinge_subgrad</code> due to the fact that <code>x</code>
will be a sparse matrix. This method just turns <code>x</code> into a regular numpy array. Note also that we use <code>np.maximum</code>
rather than <code>np.max</code>. <code>np.maximum</code> is more along the lines of the vector-based version of <code>np.max</code>.</p>
<p>Now, the hinge loss as we've implemented it only calculates the loss for a single training example $\{(x, t)\}$. We want to add in a
function which aggregates the loss across all examples in the training set. We can do that with the following function</p>
<pre><code class="language-python">def loss(w, X, y):
preds = X @ w
losses = [hinge_loss(t, y_) for t, y_ in zip(y, preds)]
return np.mean(losses)</code></pre>
<p>Finally, we'll add a couple of functions that handle making predictions for us given <code>w</code> and <code>x</code>. Note, I'm not
including a variable <code>b</code> in either of these. That's because we can collapse <code>b</code> into <code>x</code> by adding an
extra dimension to <code>x</code> and <code>w</code> with starting value 1.</p>
<pre><code class="language-python">def predictor(w, x):
return x @ w
def predict(w, X):
preds = X @ w
z = (preds > 0).astype(int)
z[z == 0] = -1
return z</code></pre>
<p>The function <code>predictor</code> takes in a single training value <code>x</code> and the weight vector <code>w</code> and returns an
unnormalized prediction <code>y = x @ w</code> which is fed to our <code>hinge_loss</code> function. On the other hand, <code>predict</code>
takes in an entire batch of training inputs <code>X</code> and returns an output vector of 1's and -1's. We'll use this to calculate the
accuracy of our model during both training and the final runtime.</p>
<p>Finally, we need a simple method that we can use to initialize our weight vector <code>w</code> at the start of training.</p>
<pre><code class="language-python">def init_w(x):
return np.random.randn(x.shape[1])</code></pre>
<h3>Coding the Subgradient Method</h3>
<p>Finally, we arrive at our pinnacle moment, that of coding up the subgradient method. The code for this method is a bit involved, but it
contains some design choices that are interesting to consider. Let's take a look at the finished product first, then walk through it step
by step.</p>
<pre><code class="language-python">def subgrad_descent(targets, inputs, w, eta=0.5, eps=.001):
curr_min = sys.maxsize
curr_iter, curr_epoch = 0, 0
while True:
curr_epoch += 1
idxs = np.arange(targets.shape[0])
np.random.shuffle(idxs)
targets = targets[idxs]
inputs = inputs[idxs]
for i, (t, x) in enumerate(zip(targets, inputs)):
curr_iter += 1
if curr_iter % 100 == 0:
preds = predict(w, inputs)
curr_acc = np.mean(preds == targets)
converged = curr_acc > .95
if converged:
return w, inputs, targets
print(f"Current epoch: {cur_epoch}")
print(f"Running iter: {curr_iter}")
print(f"Current loss: {cur_min}")
print(f"Current acc: {curr_acc}\n")
y = predictor(w, x)[0]
subgrad = hinge_subgrad(t, y, x)
w_test = np.squeeze(w - eta * subgrad)
obj_val = loss(w_test, inputs, targets)
if obj_val < cur_min:
cur_min = obj_val
w = w_test</code></pre>
<p>Now, to start, I'll say that my method is subtly different from the subgradient method I detailed in the mathematical walkthrough. That's
because in that version, all objective function values $f(w^k)$ are kept track of from start to end and the max is taken at the end of the
iteration. I take a slightly different approach. I start with an iteration $w^k$ and I only step to $w^{k+1}$ if it decreases the current
minimum loss seen across the dataset. This seems to work well, allowing me to achieve greater than 95% accuracy once the model finishes training,
but I encourage you to try out both approaches and see which works best. You'll see that I start with a current objective function value of
<code>sys.maxsize</code>. This is the max value that Python can represent, so any subsequent function value iterates are guaranteed to be less
than this value. Next, I iterate <code>while True</code> and only break from the iteration once I achieve some desired accuracy on the training
set (here I have this set to 95%). Every 100 iterations, I evaluate the current model on the training set and see if it achieves this threshold.
Note, due to time constraints, I'm being a bit careless and not creating validation sets, etc. That's because the purpose of this tutorial is to
demonstrate the viability of the subgradient method, but not to serve as an example of training procedure best practices. I encourage you to clean
things up in your own code. It's a great learning exercise.</p>
<p>Now, the method shuffles the dataset at the start of each epoch. You can see this in the lines</p>
<pre><code class="language-python">curr_epoch += 1
idxs = np.arange(targets.shape[0])
np.random.shuffle(idxs)
targets = targets[idxs]
inputs = inputs[idxs]</code></pre>
<p>This is a simple shuffling method based on randomizing the indices into the dataset. While sklearn has built in shuffling methods, I try to rely
on it as little as possible because, quite frankly, what's the fun in having some external library handle everything? The Flamethrower AI ethos is
built on learning by building everything from scratch, so that's what I'm aiming to do here.</p>
<p>Next, I iterate over each example in the training set. In effect, I'm performing <i>stochastic</i> subgradient descent.
At each step, I get the unnormalized prediction <code>y = predictor(w, x)[0]</code> using the current weight vector <code>w</code> and calculate
the corresponding <code>subgrad</code>. I then check what the updated weight vector as calculated by the subgradient method would be (given here
as <code>w_test = np.squeeze(w - eta * subgrad)</code>) and evaluate the loss that that updated vector achieves across the <i>entire</i> dataset.
If that <code>obj_val</code> is less than the <code>cur_min</code> then I set it to be the new <code>w</code>. Otherwise, I skip the update. Either
way, that completes the iteration, and the next round of updates is subsequently computed in the same manner, ad infinitum, until some desired accuracy
is reached.</p>
<h1>Conclusion</h1>
<p>That about sums up the subgradient method. I was able to reach >95% accuracy with my hand-coded method, and I'm sure it's even possible to do better
than that with some simple tweaks to the data normalization process, refining the training updates, fiddling with learning rates, etc. However, I
think what I have here provides a solid demonstration of the workability of this method. Note that training <i>will be slow.</i> This is for a couple of
reasons. First, the rate of convergence os the subgradient method is on the slower side. Second, we're implementing this in pure Python, so we're
certainly not going to be breaking any speed records here. Don't get discouraged if you compare this method against the built-in Sklearn SVM. The
sklearn code is actually just a thin wrapper over <code>libsvm</code> which is implemented entirely in heavily optimized C++, so naturally it's going to
be a lot faster. That said, this method converges relatively quickly on this dataset. I was getting good results in 5-10 minutes, certainly a lot less
time than you'd have to wait to train your favorite deep learning classifier on a really large dataset (not this one). Anyway, I hope this tutorial
introduced you to some of the fun of the subgradient method and advanced optimization techniques. If you want to get access to the repository with the
full solutions code as well as access to more advanced tutorials like this one, then I encourage you to sign up for a full Flamethrower AI membership.
It's only $20/month, and you get access to all our current and future courses for that one, low, monthly price. Cheers!</p>
Fri, 24 Apr 2020 00:00:00 +0000
http://mcneela.github.io/machine_learning/2020/04/24/Subgradient-Descent.html
http://mcneela.github.io/machine_learning/2020/04/24/Subgradient-Descent.htmlmachine_learningBuild Your Own Deep Learning Library, From Scratch<h1>Build Your Own Deep Learning Library, From Scratch</h1>
Have you ever wondered how deep learning libraries like PyTorch and Tensorflow actually work? If you're like me,
this question has probably been tugging at you for a while, and there isn't really any material online that teaches
you about these libraries' internals, outside of some obscure research papers and short of diving directly into the code.
Late last year, I had finally had enough of wondering, and I embarked on a quest to dispel the mystery of these deep learning
frameworks once and for all. I learned how they actually work, and I distilled all of the knowledge I gained into my new
course at <a href="http://flamethrower.ai">http://flamethrower.ai</a>. In this course, you'll learn about advanced deep learning
concepts like automatic differentation, hardware level optimizations, regularization techniques, Maximum a Posteriori, and so much
more. And you'll do all of this by building your very own deep learning library, COMPLETELY FROM SCRATCH! It's the first course of
its kind, and one that will prepare you exceedingly well for a career as a data scientist or machine learning engineer. It's truly
a one of a kind course and learning experience.
</br></br>
What's more, I plan to continue researching advanced deep learning tools and topics that haven't been covered in other online courses
and create courses based on their implementation. If you sign up for the site now, you'll gain access to all these new courses as I
develop them, all for one low, monthly price. It's an unbeatable offer that you won't find anywhere else.
</br></br>
As a thank you for reading my blog, I'm going to give 25% off a Flamethrower AI membership in perpetuity to the first 30 people who email
me at <a href="mailto:hello@flamethrower.ai">hello@flamethrower.ai</a> with the subject line "Flamethrower AI Blog Discount".
I truly hope you're as excited for this course as I am, and I look forward to teaching you all about the advanced deep learning concepts
you never knew about.
</br></br>
All the best,
</br>
Daniel
Sat, 11 Apr 2020 00:00:00 +0000
http://mcneela.github.io/machine_learning/2020/04/11/Build-Your-Own-Deep-Learning-Library.html
http://mcneela.github.io/machine_learning/2020/04/11/Build-Your-Own-Deep-Learning-Library.htmlmachine_learningWriting Your Own Optimizers in PyTorch<h1 id="writing-your-own-optimizers-in-pytorch">Writing Your Own Optimizers in PyTorch</h1>
<p>This article will teach you how to write your own optimizers in PyTorch - you know the kind, the ones where you can write something like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>optimizer = MySOTAOptimizer(my_model.parameters(), lr=0.001)
for epoch in epochs:
for batch in epoch:
outputs = my_model(batch)
loss = loss_fn(outputs, true_values)
loss.backward()
optimizer.step()
</code></pre></div></div>
<p>The great thing about PyTorch is that it comes packaged with a great standard library of optimizers that will cover all of your garden variety machine learning needs.
However, sometimes you’ll find that you need something just a little more specialized. Maybe you wrote your own optimization algorithm that works particularly well
for the type of problem you’re working on, or maybe you’re looking to implement an optimizer from a recently published research paper that hasn’t yet made its way
into the PyTorch standard library. No matter. Whatever your particular use case may be, PyTorch allows you to write optimizers quickly and easily, provided you know
just a little bit about its internals. Let’s dive in.</p>
<h2 id="subclassing-the-pytorch-optimizer-class">Subclassing the PyTorch Optimizer Class</h2>
<p>All optimizers in PyTorch need to inherit from <code class="language-plaintext highlighter-rouge">torch.optim.Optimizer</code>. This is a base class which handles all general optimization machinery. Within this class,
there are two primary methods that you’ll need to override: <code class="language-plaintext highlighter-rouge">__init__</code> and <code class="language-plaintext highlighter-rouge">step</code>. Let’s see how it’s done.</p>
<h3 id="the-init-method">The <strong>init</strong> Method</h3>
<p>The <code class="language-plaintext highlighter-rouge">__init__</code> method is where you’ll set all configuration settings for your
optimizers. Your <code class="language-plaintext highlighter-rouge">__init__</code> method must take a <code class="language-plaintext highlighter-rouge">params</code> argument which specifies
an iterable of parameters that will be optimized. This iterable must have a
deterministic ordering - the user of your optimizer shouldn’t pass in something
like a dictionary or a set. Usually a list of <code class="language-plaintext highlighter-rouge">torch.Tensor</code> objects is given.</p>
<p>Other typical parameters you’ll specify in the <code class="language-plaintext highlighter-rouge">__init__</code> method include
<code class="language-plaintext highlighter-rouge">lr</code>, the learning rate, <code class="language-plaintext highlighter-rouge">weight_decays</code>, <code class="language-plaintext highlighter-rouge">betas</code> for Adam-based optimizers,
etc.</p>
<p>The <code class="language-plaintext highlighter-rouge">__init__</code> method should also perform some basic checks on passed in
parameters. For example, an exception should be raised if the provided learning
rate is negative.</p>
<p>In addition to <code class="language-plaintext highlighter-rouge">params</code>, the <code class="language-plaintext highlighter-rouge">Optimizer</code> base class requires a parameter called
<code class="language-plaintext highlighter-rouge">defaults</code> on initialization. This should be a dictionary mapping parameter
names to their default values. It can be constructed from the kwarg parameters
collected in your optimizer class’ <code class="language-plaintext highlighter-rouge">__init__</code> method. This will be important in
what follows.</p>
<p>The last step in the <code class="language-plaintext highlighter-rouge">__init__</code> method is a call to the <code class="language-plaintext highlighter-rouge">Optimizer</code> base class.
This is performed by calling <code class="language-plaintext highlighter-rouge">super()</code> using the following general signature.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>super(YourOptimizerName, self).__init__(params, defaults)
</code></pre></div></div>
<h2 id="implementing-a-novel-optimizer-from-scratch">Implementing a Novel Optimizer from Scratch</h2>
<p>Let’s investigate and reinforce the above methodology using an example taken
from the HuggingFace <code class="language-plaintext highlighter-rouge">pytorch-transformers</code> NLP library. They implement a PyTorch
version of a weight decay Adam optimizer from the BERT paper. First we’ll take a
look at the class definition and <code class="language-plaintext highlighter-rouge">__init__</code> method. Here are both combined.</p>
<p><img src="/images/adamw-init.png" style="height: 75%; width: 75%" /></p>
<p>You can see that the <code class="language-plaintext highlighter-rouge">__init__</code> method accomplishes all the basic requirements
listed above. It implements basic checks on the validity of all provided <code class="language-plaintext highlighter-rouge">kwargs</code>
and raises exceptions if they are not met. It also constructs a dictionary of
defaults from these required parameters. Finally, the <code class="language-plaintext highlighter-rouge">super()</code> method is called
to initialize the <code class="language-plaintext highlighter-rouge">Optimizer</code> base class using the provided <code class="language-plaintext highlighter-rouge">params</code> and <code class="language-plaintext highlighter-rouge">defaults</code>
.</p>
<h3 id="the-step-method">The step() Method</h3>
<p>The real magic happens in the <code class="language-plaintext highlighter-rouge">step()</code> method. This is where the optimizer’s logic
is implemented and enacted on the provided parameters. Let’s take a look at how
this happens.</p>
<p>The first thing to note in <code class="language-plaintext highlighter-rouge">step(self, closure=None)</code> is the presence of the
<code class="language-plaintext highlighter-rouge">closure</code> keyword argument. If you consult the PyTorch documentation, you’ll
see that <code class="language-plaintext highlighter-rouge">closure</code> is an optional callable that allows you to reevaluate the
loss at multiple time steps. This is unnecessary for most optimizers, but is
used in a few such as Conjugate Gradient and LBFGS. According to the docs,
“the closure should clear the gradients, compute the loss, and return it”.
We’ll leave it at that, since a closure is unnecessary for the <code class="language-plaintext highlighter-rouge">AdamW</code> optimizer.</p>
<p>The next thing you’ll notice about the <code class="language-plaintext highlighter-rouge">AdamW</code> step function is that it iterates
over something called <code class="language-plaintext highlighter-rouge">param_groups</code>. The optimizer’s <code class="language-plaintext highlighter-rouge">param_groups</code> is a list
of dictionaries which gives a simple way of breaking a model’s parameters into
separate components for optimization. It allows the trainer of the model to
segment the model parameters into separate units which can then be optimized
at different times and with different settings. One use for multiple <code class="language-plaintext highlighter-rouge">param_groups</code>
would be in training separate layers of a network using, for example, different
learning rates. Another prominent use cases arises in transfer learning. When
fine-tuning a pretrained network, you may want to gradually unfreeze layers
and add them to the optimization process as finetuning progresses. For this,
<code class="language-plaintext highlighter-rouge">param_groups</code> are vital. Here’s an example given in the PyTorch documentation
in which <code class="language-plaintext highlighter-rouge">param_groups</code> are specified for SGD in order to separately tune the
different layers of a classifier.</p>
<p><img src="/images/param-groups.png" style="height: 75%; width: 75%" /></p>
<p>Now that we’ve covered some things specific to the PyTorch internals, let’s get
to the algorithm. Here’s a link to
<a href="https://arxiv.org/pdf/1711.05101.pdf">the paper</a>
which originally proposed the AdamW algorithm. And here, from the paper, is a
screenshot of the proposed update rules.</p>
<p><img src="/images/adamw-details.png" style="display: block; margin-left: auto; margin-right: auto; width: 75%" /></p>
<p>Let’s go through this line by line with the source code. First, we have the
loop</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for p in group['params']
</code></pre></div></div>
<p>Nothing mysterious here. For each of our parameter groups, we’re iterating over
the parameters within that group. Next.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
</code></pre></div></div>
<p>This is all simple stuff as well. If there is no gradient for the current
parameter, we just skip it. Next, we get the actual plain Tensor object for
the gradient by accessing <code class="language-plaintext highlighter-rouge">p.grad.data</code>. Finally, if the tensor is sparse, we
raise an error because we are not going to consider implementing this for sparse
objects.</p>
<p>Next, we access the current optimizer state with</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>state = self.state[p]
</code></pre></div></div>
<p>In PyTorch optimizers, the <code class="language-plaintext highlighter-rouge">state</code> is simply a dictionary associated with the
optimizer that holds the current configuration of all parameters.</p>
<p>If this is the first time we’ve accessed the state of a given parameter, then we
set the following defaults</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p.data)
</code></pre></div></div>
<p>We obviously start with step 0, along with zeroed out exponential average and
exponential squared average parameters, both the shape of the gradient tensor.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
beta1, beta2 = group['betas']
state['step'] += 1
</code></pre></div></div>
<p>Next, we gather the parameters from the state dict that will be used in the
computation of the update. We also increment the current step.</p>
<p>Now, we begin the actual updates. Here’s the code.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Decay the first and second moment running average coefficient
# In-place operations to update the averages at the same time
exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_(1.0 - beta2, grad, grad)
denom = exp_avg_sq.sqrt().add_(group['eps'])
step_size = group['lr']
if group['correct_bias']: # No bias correction for Bert
bias_correction1 = 1.0 - beta1 ** state['step']
bias_correction2 = 1.0 - beta2 ** state['step']
step_size = step_size * math.sqrt(bias_correction2) / bias_correction1
p.data.addcdiv_(-step_size, exp_avg, denom)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want to decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
# Add weight decay at the end (fixed version)
if group['weight_decay'] > 0.0:
p.data.add_(-group['lr'] * group['weight_decay'], p.data)
</code></pre></div></div>
<p>The above code corresponds to equations 6-12 in the algorithm implementation from
the paper. Following along with the math should be easy enough. What I’d like to
take a closer look at is the built in Tensor methods that allow us to do the
in-place computations.</p>
<p>A nice, relatively hidden feature of PyTorch which you might not be aware of is
that you can access any of the standard PyTorch functions, e.g. <code class="language-plaintext highlighter-rouge">torch.add()</code>,
<code class="language-plaintext highlighter-rouge">torch.mul()</code>, etc. as in-place operations on the Tensors directly by appending
an <code class="language-plaintext highlighter-rouge">_</code> to the method name. Thus, taking a closer look at the first update, we
find we can quickly compute it as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
</code></pre></div></div>
<p>rather than</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch.mul(beta1, torch.add(1.0 - beta1, grad))
</code></pre></div></div>
<p>Of course, there are a few special operations used here with which you may not
be familiar, for example, <code class="language-plaintext highlighter-rouge">Tensor.addcmul_</code> and <code class="language-plaintext highlighter-rouge">Tensor.addcdiv_</code>. This takes the
input and adds it to either the product or dividend, respectively, of the two
latter inputs. If you need a more in-depth rundwon of the various operations
available to be performed on <code class="language-plaintext highlighter-rouge">Tensor</code> objects, I highly recommend checking out
<a href="https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/">this post</a>.</p>
<p>You’ll also see that the learning rate is accessed in the last line in the
computation of the final result. This loss is then returned.</p>
<p>And…that’s it! Constructing your own optimizers is as simple as that. Of course,
you need to devise your own optimization algorithm first, which can be a little
bit trickier ;). I’ll leave that one to you.</p>
<p>Special thanks to the authors of Hugging Face for implementing the <code class="language-plaintext highlighter-rouge">AdamW</code>
optimizer in PyTorch.</p>
Tue, 03 Sep 2019 00:00:00 +0000
http://mcneela.github.io/machine_learning/2019/09/03/Writing-Your-Own-Optimizers-In-Pytorch.html
http://mcneela.github.io/machine_learning/2019/09/03/Writing-Your-Own-Optimizers-In-Pytorch.htmlmachine_learningMachine Learning for the Movies<h1 align="center">Machine Learning for the Movies</h1>
(Note: This article originally appeared at <a href="https://www.clarifai.com/blog/machine-learning-for-the-movies">https://www.clarifai.com/blog/machine-learning-for-the-movies</a>. If you are looking for a great computer
vision solution for your machine learning product, I highly recommend you
check out Clarifai!)
</br></br>
For many, machine learning is useful only insomuch that the insights it generates drive business or cut costs. Within the film industry, top studios have traditionally banked huge budgets on new scripts predicated on little but studio executives’ past experience, intuition, and hopeful conjecture. However, 20th Century Fox recently demonstrated that a paradigm shift within the entertainment industry may be underway. Their team of data scientists and researchers devised a machine learning model called Merlin Video that leverages film trailer data to predict which movies a filmgoer would be most likely to see given their viewing history and other demographic information. The team chose movie trailers as their object of study because they act as the most impactful determinant in a customer’s decision as to whether or not they will go to see a movie in theatres.
</br></br>
<h2>Understanding the Model Architecture</h2>
At the heart of Merlin are convolutional neural networks (CNNs), a type of model which has classically been used to achieve state of the art results on image recognition tasks. Merlin employs these neural networks by applying them to the individual frames of a movie trailer; however, the architecture includes clever processing steps which allow the model to capture certain aspects of the trailer’s timing. The model also relies heavily on a technique called collaborative filtering that’s commonly used when devising recommender systems. The crux of the idea is that a recommendation model should incorporate a wide diversity of data sources. In addition, it relies on the belief that if user A has similar tastes to user B on known data, then that shared similarity in preferences is likely to extend to unknown data.
<div align="middle">
<img src="/images/movie-model-breakdown.png" style="height: 50%; width: 50%"></img>
</div>
The output of the model relies primarily on what are called the movie and user vectors. The idea is that if accurate representations of each can be computed, then a proxy for the affinity a given user has for a given movie can be determined by computing the distance between their respective vectors. This distance is combined with user frequency and recency data and fed into a simple logistic regression classifier which provides the final output prediction giving the probability that user i will watch movie j.
</br></br>
So how are the movie and user vectors created? The user vector is actually pretty simple. It’s just the averaged sum of the vectors corresponding to the movies that that particular user attended. As such, the real magic of the model relies in the creation of the movie vector. The movie vector is, in fact, created by the CNN previously alluded to. The global structure of the network is that it defines a number of features designed to capture specific actions relevant to a movie’s content. For example, one feature might seek to determine whether a trailer involves long, scenic shots of nature. This could indicate that the trailer is for a documentary. Another feature might try to detect a fast-paced fist fight indicative of an action movie. A key aspect of the model is that it goes beyond conventional CNNs by capturing the pacing and temporality of film sequences. That means it can tell the difference between quickly flickering frames which might indicate a flashback or a high speed chase and long, drawn out shots of dialogue or other slow-moving moments. Here’s the full diagram of the model which computes the movie vector.
<div align="middle">
<img src="/images/movie-layer-details.png" style="height: 50%; width: 50%"></img>
</div>
</br></br>
<h2>Training the Model</h2>
The team at Fox trained the model on YouTube8M, a rich dataset provided by Google and consisting of 6.1 million YouTube videos annotated with any of 3800+ entities. The dataset provides 350,000 hours of video described by 2.6 billion precomputed audio and video features. This provides massive explanatory power for the Merlin model to take advantage of.
<div align="middle">
<img src="/images/model-architecture-flow.png" style="height: 50%; width: 50%"></img>
</div>
</br></br>
If you take a look at the above diagram which lays out the Merlin architecture’s data flow, you’ll see that they also feed the model film metadata, textual synopses, and data about the customer acquired at the ticket box.
</br></br>
<h2>Evaluating the Model</h2>
To assess the accuracy of their model, the team at Fox evaluated its predictions on the trailer of the recently released action flick, Logan. For those unfamiliar with the film, it’s an X-Men spinoff which focuses on the trials and travails of Wolverine as he wages war against the bad guys and saves the girl. Pretty typical Hollywood stuff. Astonishingly, the Merlin model captures the majority of the key ideas presented in the Logan trailer and uses these to accurately predict similar movies for filmgoers to see. Here’s the data that the Fox team got from the model and its comparison with actual customer behavior.
<div align="middle">
<img src="/images/movie-model-output-results.png" style="height: 50%; width: 50%"></img>
</div>
On the left, you can see the Top 20 movies that a user who saw Logan was mostly likely to watch. On the right, you can see the Top 20 predictions made by the model. Astonishingly, the model got all of the Top 5 actual movies within its Top 20 predictions. As a result, it’s reasonable to believe that the model was able to distill the key characteristics of Logan in order to infer its own predictions. That’s the power of machine learning.
Mon, 26 Aug 2019 00:00:00 +0000
http://mcneela.github.io/machine_learning/2019/08/26/Machine-Learning-for-the-Movies.html
http://mcneela.github.io/machine_learning/2019/08/26/Machine-Learning-for-the-Movies.htmlmachine_learningThe Basics of Homology Theory<h1>The Basics of Homology Theory</h1>
Algebraic topology is a field that seeks to establish
correspondences between algebraic structures and topological
characteristics. It then uses results from algebra to infer
and uncover results about topology. It's a pretty powerful method.
</br></br>
Roughly speaking, AT provides two different frameworks for
characterizing topological spaces. These are <i>homotopy</i>
and <i>homology</i>. In this post, we'll start to take a look
at homology, which differs from homotopy in that it's less
powerful in some senses, but significantly easier to work with
and compute which endows it with a different sort of power.
</br></br>
<h2>Some Definitions to Start</h2>
<b><u>Definition</u></b> Say we're working in $\mathbb{R}^n$. The <i>$p$-simplex</i>, defined for $p \leq n$, is
$$\Delta_p := \left\{x = \sum_{i=0}^p \lambda_i e_i \mid \sum_{i=0}^p \lambda_i\ = 1, \lambda_i \geq 0\right\}$$
$\Delta_p$ is a generalization of the triangle to $p$ dimensions. Now, there are actually two types of homology
which have developed in the annals of mathematics.
The first of these is <i>singular homology</i>, which concerns
itself with the study of topological spaces via the mapping of
simplices into these spaces. The other type is called
<i>Cech homology</i> which handles the study of topology via
the approximation of topological spaces with spaces of a certain
class, namely those which admit a triangulation. Of these two
branches of homology, singular is by far the more prevalent in
the literature and is the one we'll delve into here.
</br></br>
Given a triangular polytope like we've defined in $\Delta_p$,
one operation we might like to consider is using that region
as a way to sort of define regions of interest around not just
basis vectors, but any arbitrary collection of vectors in the space. Given such a set $\{v_0, \ldots, v_p\}$ we can denote
by $[v_0, \ldots, v_p]$ the mapping of $\Delta_p \to \mathbb{R}^n$ defined by
$$\sum_{i=0}^p \lambda_i e_i \to \sum_{i=0}^p \lambda_i v_i$$
What this gives us from an intuitive perspective is the simplex
expanded or shrunken to cover the span of the $v_i$. One nice
property of this map is that its image is convex. We call the
resulting simplex the <i>affine p-simplex</i>, and we sometimes
refer to the $\lambda_i$ in this context as <i>barycentric coordinates</i>.
</br></br>
In addition to mapping between simplices and sets of vectors,
we'd like to define a way to map between a $p$-simplex and a
($p + 1$)-simplex and vice versa. The mapping from
$p$ to $p + 1$ is called the $i$th face map and is notated
as
$$F_i^{p+1} : \Delta_p \to \Delta_{p + 1}$$
It is formed by deleting the $i$th vertex in dimension
$p + 1$. To notate this, we can write $[e_0, \ldots, \hat{e_i}
, \ldots, e_{p+1}]$ where the hat indicates that $e_i$ is
omitted. This face map is so named because it embeds the
$p$-simplex in the $p+1$-simplex as the face opposite $e_i$, the
vertex that's being deleted.
</br></br>
<b><u>Definition</u></b> For a topological space $X$, a <i>
singular $p$-simplex</i> of $X$ is simply a continuous function
$\sigma_p : \Delta_p \to X$.
</br></br>
<div align="middle">
<img src="/images/singular2simplex.png"></img>
<p>The singular 2-simplex, mapping from the standard 2-simplex to $X$.</p>
</div>
Basically, our goal in defining the singular $p$-simplices is
to provide a sort of "basis" for the triangulation of an
arbitrary topological space. In this way, the singular $p$-simplex gives us a way to cover some patch of $X$ with
a triangle-like region. Accordingly, we can define a group
which in some sense acts like a vector space of these triangular
basis regions. By allowing the linear combination of maps on
these elements, we can give entire triangulations of the space
$X$.
</br></br>
<b><u>Definition</u></b> The <i>singular p-chain group
$\Delta_p(X)$</i> is the free abelian group that's generated by the singular $p$-simplices.
</br></br>
In more concrete terms, the elements of $\Delta_p(X)$ are called
$p$-chains and are simply linear combinations
$$c = \sum_\sigma n_\sigma \sigma$$ of $p$-simplices with
coefficients $n_\sigma$ coming from some ring (usually the
integers).
</br></br>
We can recover how the singular $p$-simplex maps the faces of
$\Delta_p$ to $X$ by simply composing the map $\sigma$ with the
face map $F_i^p$. This is called the <i>ith face of $\sigma$</i>
and is written
$$\sigma^{(i)} = \sigma \circ F_i^p$$
For a given singular $p$-simplex $\sigma$, we can defined a $(p-1)$-chain that gives the <i>boundary</i> of $\sigma$ as
$$\partial_p \sigma = \sum_{i=0}^p (-1)^i \sigma_p^{(i)}$$
The boundary operator extends to chains in the natural way, by distributing over addition, $\partial_p c = \partial_p (\sum_\sigma n_\sigma \sigma) = \sum_\sigma n_\sigma \partial_p
\sigma$. This law, in fact, makes $\partial_p$ into a
homomorphism of groups
$$\partial_p : \Delta_p(X) \to \Delta_{p-1}(X)$$
Now, we introduce a key fact about the boundary operator that
will allow us to define homology groups. You should be well
aware of the saying "the enemy of my enemy is my friend".
Well, in algebraic topology boundaries are our enemies and
we have a similar statement, "the boundary of my boundary is
a loser (zero)". In other words, for any $\sigma$,
$$\partial_p(\partial_{p + 1} \sigma) = 0$$
I'll skip the proof for this because it's just a really nasty
rearrangement of a complicated sum, but if you're truly
interested you can find it in pretty much any algebraic
topology textbook.
</br></br>
We call $im(\partial_{p+1}) = B_{p}(X)$ the
<i>boundary group</i> and $ker(\partial_p) = Z_p(X)$ the
<i>cycle group</i>.
Note that the above identity implies that $im(\partial_{p + 1})$ is a subgroup of $ker(\partial_p)$ which is equivalent
to saying $B_p(C) \leq Z_p(C)$.
</br></br>
<h2>The Homology Group</h2>
We define the $p$th <i>homology group</i> as the quotient
$$H_p(X) := Z_p(X) / B_p(X)$$
In other words, this is the group of cycles modded out by
the group of boundaries. To give a rough intuition, the
rank of $H_p(X)$ tells us the number of $p$-dimensional
"holes" contained in $X$. You can think about it as follows.
If each cycle in $X$ is equivalent to some boundary, then
those boundaries have no "interior" excisions. However, if a
hole does exist, then there will be some discrepancy between
boundaries and cycles, and the number of such discrepancies
(holes) will be given by the rank of $H_p(X)$.
Thu, 13 Jun 2019 00:00:00 +0000
http://mcneela.github.io/mathematics/2019/06/13/The-Basics-Of-Homology-Theory.html
http://mcneela.github.io/mathematics/2019/06/13/The-Basics-Of-Homology-Theory.htmlmathematicsThe Problem with Policy Gradient<h1 align="middle">The Problem(s) with Policy Gradient</h1>
If you've read my <a href="http://mcneela.github.io/math/2018/04/18/A-Tutorial-on-the-REINFORCE-Algorithm.html">article</a>
about the REINFORCE algorithm, you should be familiar with the update that's typically used in policy gradient methods.
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}
\left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$
It's an extremely elegant and theoretically satisfying model that suffers from only one problem - it doesn't work well in practice.
Shocking, I know! Jokes abound about the flimsiness that occurs when policy gradient methods are applied to practical problems.
One such joke goes like this: if you'd like to reproduce the results of any sort of RL policy gradient method as reported in academic
papers, make sure you contact the authors and get the settings they used for their random seed. Indeed, sometimes policy gradient can
feel like nothing more than random search dressed up in mathematical formalism. The reasons for this are at least threefold
(I won't rule out the possibility that there are more problems with this method of which I'm not yet aware), namely that
</br></br>
<ol>
<li>Policy gradient is <b>high variance</b>.</li>
<li>Convergence in policy gradient algorithms is <b>sloooow</b>.</li>
<li>Policy gradient is terribly <b>sample inefficient</b>.</li>
</ol>
I'll walk through each of these in reverse because flouting the natural order of things is fun. :)
</br></br>
<h3>Sample Inefficiency</h3>
In order to get anything useful out of policy gradient, it's necessary to sample from your policy and observe the resultant reward
literally <i>millions of times</i>. Because we're sampling directly from the policy we're optimizing, we say that policy gradient
is an <i>on-policy</i> algorithm.
If you take a look at the formula for the gradient update, we're calculating an expectation and
we're doing that in the Monte Carlo way, by averaging over a number of trial runs. Within that, we have to sum over all the steps in
a single trajectory which itself could be frustratingly expensive to run depending on the nature of the environment you're working
with. So we're iterating sums over sums, and the result is that we incur hugely expensive computational costs in order to acquire
anything useful. This works fine in the realms where policy gradient has been successfully applied. If all you're interested
in is training your computer to play Atari games, then policy gradient might not be a terrible choice. However, imagine using this
process in anything remotely resembling a real-world task, like training a robotic arm to perform open-heart surgery, perhaps?
Hello, medical malpractice lawsuits. However, sample inefficiency is not a problem that's unique to policy gradient methods by any
means. It's an issue that plagues many different RL algorithms, and addressing this is key to generating a model that's useful
in the real world. If you're interested in sample efficient RL algorithms, check out
<a href="https://www.microsoft.com/en-us/research/blog/provably-efficient-reinforcement-learning-with-rich-observations/?ocid=msr_blog_provably_icml_hero">the work</a> that's being
done at Microsoft Research.
</br></br>
<h3>Slow Convergence</h3>
This issue pretty much goes hand in hand with the sample inefficiency discussed above and the problem of high variance to be
discussed below. Having to sample entire trajectories on-policy before each gradient update is slow to begin with, and the
high variance in the updates makes the search optimization highly inefficient which means more sampling which means more updates,
ad infinitum. We'll discuss some remedies for this in the next section.
</br></br>
<h3>High Variance</h3>
The updates made by the policy gradient are very high variance. To get a sense for why this is, first considering that in RL we're
dealing with highly general problems such as teaching a car to navigate through an unpredictable environment or programming an agent
to perform well across a diverse set of video games. Therefore, when we're sampling multiple trajectories from our untrained policy
we're bound to observe highly variable behaviors. Without any a priori model of the system we're seeking to optimize, we begin with
a policy whose distribution of actions over a given state is effectively uniform. Of course, as we train the model we hope to shape
the probability density so that it's unimodal on a single action, or possibly multimodal over a few successful actions that can be
taken in that state. However, acquiring this knowledge requires our model to observe the outcomes of many different actions taken
in many different states. This is made exponentially worse in continuous action or state spaces as visiting even close to every
state-action pair is computationally intractable. Due to the fact that we're using Monte Carlo estimates in policy gradient, we
trade off between computational feasibility and gradient accuracy. It's a fine line to walk, which is why variance reduction techniques
can potentially yield huge payoffs.
</br></br>
Another way to think about the variance introduced into the policy gradient update is as follows: at each time step in your trajectory
you're observing some stochastic event. Each such event has some noise, and the accumulation of even a small amount of noise across
a number of time steps results in a high variance outcome. Yet, understanding this allows us to suggest some ways to alter policy
gradient so that the variance might ultimately be reduced.
</br></br>
<h1 align="middle">Improvements to Policy Gradient</h1>
<h3>Reward to Go</h3>
The first "tweak" we can use is incredibly simple. Let's take a look again at that policy gradient update.
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}
\left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$
If we break it down into the Monte Carlo estimate, we get
$$\nabla_\theta J(\theta) =
\frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_{t=1}^T r(s_t, a_t)\right)\right]$$
If we distribute $\sum_{t=1}^T r(s_t, a_t)$ into the left innermost sum involving $\nabla \log \pi_{\theta}$, we see that we're
taking the gradient of $\log \pi_\theta$ at a given time step $t$ and weighting it by the sum of rewards at all timesteps. However,
it would make a lot more sense to simply reweight this gradient by the rewards it affects. In other words, the action taken at time
$t$ can only influence the rewards accrued at time $t$ and beyond. To that end, we replace $\sum_{t=1}^T r(s_t, a_t)$ in the gradient
update with the partial sum $\sum_{t'=t}^T r(s_{t'}, a_{t'})$ and call this quantity $\hat{Q}_{t}$ or the "reward to go". This quantity
is closely related to the $Q$ function, hence the similarity in notation. For clarity, the entire policy gradient update now becomes
$$\frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_{t=t'}^T r(s_{t'}, a_{t'})\right)\right]$$
<h3>Baselines</h3>
The next technique for reducing variance is not quite as obvious but still yields great results. If you think about how policy gradient
works, you'll notice that how we take our optimization step depends heavily on the reward function we choose. Given a trajectory $\tau$,
if we have a negative return $r(\tau) = \sum_{t} r(s_t, a_t)$ then we'll actually take a step in the direction opposite the gradient,
which should have the effect of lessening the probability density on the trajectory. For those trajectories that have positive return,
their probability density will increase. However, if we do something as simple as setting $r(\tau) = r(\tau) + b$ where $b$ is a
sufficiently large constant such that the return for $r(\tau)$ is now positive, then we will actually increase the probability weight on
$\tau$ even though $\tau$ still fares worse than other trajectories with previously positive return. Given how sensitive the model is
to the shifting and scaling of the chosen reward function, it's natural to ask whether we can find an optimal $b$ such that
(note: we're using trajectories here so some of the sums from the original PG formulation are condensed)
$$\frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) [r(\tau_i) - b]$$
has minimum variance. We call such a $b$ a <i>baseline</i>.
We also want to ensure that subtracting $b$ in this way doesn't bias our estimate of the gradient. Let's do that
first. Recall the identity we used in the original policy gradient derivation
$$\pi_\theta(\tau) \nabla \log \pi_\theta(\tau) = \nabla \pi_\theta(\tau)$$
To show that our estimator remains unbiased, we need to
show that
$$\mathbb{E}\left[\nabla \log \pi_\theta(\tau_i)[r(\tau_i) - b]\right] = \mathbb{E} [\nabla \log \pi_\theta(\tau_i)]$$
We can equivalently show that $\mathbb{E} [\nabla \log \pi_\theta(\tau_i) b]$ is equal to zero. We have
\begin{align*}
\mathbb{E} [\nabla \log \pi_\theta(\tau_i) b]
&= \int \pi_\theta(\tau_i) \nabla \log \pi_\theta(\tau_i) b \ d\tau_i \\
&= \int \nabla \pi_\theta(\tau_i) b \ d\tau_i \\
&= \nabla b \int \pi_\theta(\tau_i) \ d\tau_i \\
&= \nabla b 1 \\
&= 0
\end{align*}
where we use the fact that $\int \pi_\theta(\tau_i) \ d\tau_i$
is 1 because $\pi_\theta$ is a probability distribution.
Therefore, our baseline enhanced version of the policy gradient
remains unbiased.
</br></br>
The question then becomes, how do we choose an optimal setting
of $b$. One natural candidate is the average reward
$b = \frac{1}{N} \sum_{i=1}^N r(\tau_i)$ over all trajectories
in the simulation. In this case, our returns are "centered",
and returns that are better than average end up being
positively weighted whereas those that are worse are negatively
weighted. This actually works quite well, but it is not, in fact, optimal. To calculate the optimal setting, let's look at
the policy gradient's variance. In general, we have
\begin{align*}
Var[x] &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \\
\nabla J(\theta) &= \mathbb{E}_{\tau \sim \pi_\theta(\tau)}
\left[ \nabla \log \pi_\theta(\tau) (r(\tau) - b)\right] \\
Var[\nabla J(\theta)] &= \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[(\nabla \log \pi_\theta(\tau) (r(\tau) - b))^2\right] - \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[ \nabla_\theta \log \pi_\theta(\tau) (r(\tau) - b)\right]^2
\end{align*}
The rightmost term in this expression is just the square of the
policy gradient, which for the purposes of optimizing $b$ we
can ignore since baselines are biased in expectation. Therefore, we turn our attention to the left term.
To simplify notation, we can write
$$g(\tau) = \nabla \log \pi_\theta(\tau)$$
Then we take the derivative to get
\begin{align*}
\frac{dVar}{db} &= \frac{d}{db} \mathbb{E}\left[
g(\tau)^2(r(\tau) - b)^2\right] \\
&= \frac{d}{db}(\mathbb{E}[g(\tau)^2r(\tau)^2] - 2
\mathbb{E}[g(\tau)^2r(\tau)b] + b^2\mathbb{E}[g(\tau)^2]) \\
&= 0 -2\mathbb{E}[g(\tau)^2r(\tau)] + 2b\mathbb{E}[g(\tau)^2]
\end{align*}
Solving for $b$ in the final equation gives
$$b = \frac{\mathbb{E}[g(\tau)^2r(\tau)]}{\mathbb{E}[g(\tau)^2]}
$$
In other words, the optimal setting for $b$ is to take the
expected reward but reweight it by expected gradient magnitudes.
</br></br>
<h2>Conclusion</h2>
Hopefully this provided you with a good overview as to how
you can improve implementations of policy gradient to speed
up convergence and reduce variance. In a future article, I'll
discuss how to derive an off-policy version of policy gradient
which improves sample efficiency and speeds up convergence.
Mon, 03 Jun 2019 00:00:00 +0000
http://mcneela.github.io/machine_learning/2019/06/03/The-Problem-With-Policy-Gradient.html
http://mcneela.github.io/machine_learning/2019/06/03/The-Problem-With-Policy-Gradient.htmlmachine_learningThe Variant Calling Problem<h1>The Variant Calling Problem</h1>
In bioinformatics, particularly in the subfield of oncology in which I work, we're often tasked with the issue of identifying variants
in a genomic sequence. What this means is that we have a sample sequence along with a <i>reference sequence</i>, and we want to identify
regions where the sample differs from the reference. These regions are called <i>variants</i>, and they can often be clinically
relevant, potentially indicating an oncogenic mutation or some
other telling clinical marker.
</br></br>
In general, there are two types of variant calling that one may be interested in performing. The first of these is called
<i>germline variant calling</i>. In this type, we use a reference genome that is an accepted standard for the species in
question. There exists more than one acceptable choice for a reference genome, but one of the current standards for humans
is called <i>GRCh38</i>. This stands for <i>Genome Reference Consortium human</i> and 38 is an identifying version number.
Note the <i>h</i> is used to distinguish the human genome from other genomes that the GRC puts out such as <i>GRCm</i>
(mouse genome).
</br></br>
The second type is called <i>somatic variant calling</i>. This
method uses two samples from a single individual and compares
the sequence of one with the other. In effect, the patients own
genome serves as the reference. This is the method commonly used
in oncology, as one sample can be taken from a tumor and another
from some non-mutated cell such as the bloodline. This process
is also sometimes called a <i>tumor-normal</i> study due to the
fact that one sample is taken from the tumor and one from a
normal cell.
</br></br>
There are a few additional things to note in the discussion
above. The first is that in germline calling we're uncovering
variants that are being <i>passed from parent to child</i> via
the germ cells, i.e. the sperm and eggs. In somatic calling,
we're identifying somatic mutations and variants, those that
occur within the body over the course of a lifetime, whether
they are due to environmental factors, errors in DNA
transcription and translation, or some other factor.
</br></br>
<h2>Types of Variants</h2>
In order to identify variants, it will be helpful to develop
an ontology of some of the specific classes and types of
variants we might expect to see as the output of our variant
callers.
</br></br>
<b>Indel</b> - This is one of the simplest types of variant
to describe conceptually, but also one of the most difficult
to identify via current variant calling methods. An <i>indel</i>
is either an insertion or deletion of a single base at some
point in the DNA sequence. The reason these are difficult to
identify is that since it is not known in advance at which
point in the DNA the variant occurs, it can throw off the
alignment with the reference sequence.
</br></br>
<b>SNP</b> - This is short for <i>single nucleotide polymorphism
</i> and it refers to a population-level variant in which a
single base differs between the sample and reference sequences.
By population-level, we mean that this variant is extremely
common across specific populations. For example, an SNP might
occur when comparing the genomes of Caucasian and Chinese people
within a gene coding for skin tone.
</br></br>
<b>SNV</b> - This stands for <i>single-nucleotide variant</i>
and is the same as an SNP apart from the fact that it indicates
a novel mutation present in the genomes of a few rather than
being widespread across a population.
</br></br>
<b>Structural Variant</b> - A <i>structural variant</i> refers
to a larger-scale variant in the genome, typically occupying at
least 1000 base pairs. These can exhibit a number of different
behaviors. For example, a >1kb region of the DNA can be
duplicated and reinserted or it can be deleted. These types of
structural variation are commonly referred to as
<i>copy number variants (CNVs)</i>. A partial sequence can also
be reversed (called <i>inversion</i>). This image sums it up
well
<div align="middle">
<img src="/images/structural_variation.png"></img>
</div>
<!-- <p align="middle"><i>(Credit: The European Bioinformatics Institute)</i></p> -->
<h2>Haplotype Phasing</h2>
Before getting into the nitty gritty of variant calling, it will be helpful to describe a related process called
<i>haplotype phasing</i>. A <i>haplotype</i> is the set of genetic information associated with a single chromosome.
The human genome is <i>diploid</i>, meaning it consists of pairs of chromosomes for which each has one part of its genetic information
inherited from the mother and the other part from the father. The combined genetic information resulting from considering these
pairs of chromosomes as single units is called the <i>genotype</i>. However, having access to the haplotype can be crucial because
it can provide information crucial to identifying disease-causing variants (either SNPs or SNVs) in the genome. The issue is that
current sequencing methods don't give us haplotype information for free as often the reads produced cannot be separated into
their individual male and female loci. Therefore, we need to use statistical algorithms to piece together what we can about the
haplotype sequences after the fact. This is where haplotype phasing comes into play.
</br></br>
<h3>Some Preliminaries and Associated Challenges</h3>
To start with, we need to establish some background with regards to population-level frequencies of genetic variation so that we
can begin to detect deviations from that standard. What may be surprising is that while the human genome is diploid, it varies
from the reference at approximately 0.1% of the bases. That means that between any two individuals, we should expect to see at
least 99.9% shared genetic information. It is at the remaining 0.1% of sites that we direct our attention when we're looking to
tease out the haplotypes. For reference, the condition of sharing genetic information is often called <i>homology</i>.
</br></br>
The teasing out of haplotypes is complicated by the fact that at times genetic information is passed from parent to child in a
process known as <i>recombination</i>. Recombination happens when instead of passing down just one chromosome from its diploid
pair, a parent passes down a combination of the two. At the biological level, this process occurs during <i>meiosis</i>. The
first phase of meiosis involves an alignment of homologous chromosome pairs between the mother and father. This process involves
a stage in which the chromosomal arms temporarily overlap, causing a crossover which may (or may not) result in fusion at that
locus at which the point of crossover occurs. This results in a recombination of genetic information which is then passed down
to the child. One benefit to this process is that it encourages genetic diversity, creating genetic patterns in the child that
appeared in neither parent. Unfortunately, it also makes haplotype phasing that much more difficult.
</br></br>
DNA sequencing gives partial haplotype information. It produces sequencing <i>reads</i> which are partial strings of the sequence up to
1000 base pairs. These reads all come from the same chromosome, but they represent only a small slice of the haplotype because the
entire chromosome is on the order of 50 million to 250 million base pairs.
</br></br>
<h3>Advantages of Knowing the Haplotype</h3>
Hopefully it's clear that having access to the haplotype for any given individual is strictly superior to simply having the genotype.
This is because, given access to a person's haplotype, we can simply combine that information in a procedural way to yield their
genotype. The same cannot be said of the reverse. What's more, having the haplotype allows us to compute statistics about various
biological properties that would otherwise be unavailable to use. For example, if we have haplotypes for a series of individuals
in a population, we can check at which loci recombination is most likely to occur, what the frequency of that recombination is, etc.
It suffices to say that having access to the haplotype is highly advantageous as a downstream input to variant calling methods, and
I'll examine some such methods that make use of this information in upcoming posts.
</br></br>
<h2>The DeepVariant Caller</h2>
Now that we have an understanding of haplotypes and some of the different ways in which variants might arise, I'd like to introduce
one of the most successful models for variant calling created to-date: Google's <i>DeepVariant</i>. This model is one of the simplest
to introduce in that it uses virtually no specialized knowledge of genomics in terms of encoding input features. Surprisingly, this
has no detrimental effect on its accuracy, as it outperformed virtually every other variant caller on the market at the time of its
release. The approach it uses is somewhat novel so I'll introduce that here, and I'll compare it in subsequent posts with the
methodology behind other variant callers.
</br></br>
<h3>How it Works</h3>
<i>DeepVariant</i> is unique in that it doesn't operate on the textual sequence information of the aligned and reference genomic reads.
Rather, it creates something called a <i>pileup image</i> from the alignment information. A pileup image shows the sequenced reads from
the sample aligned with the reference. Each base is assigned an RGB color, and possible variants within the reads as compared to the
reference are highlighted accordingly. Here's an example of what a pileup looks like.
<div align="middle">
<img src="/images/pileup.jpg" width="90%"></img>
</div>
<!-- <p align="middle"><i>(Credit: Melbourne Bioinformatics)</i></p> -->
The DeepVariant model is trained on a large dataset of such images in which variants have already been labeled. It can then be applied
to new alignments and samples by generating their pileup using a program such as SAMTOOLS. Here's how the authors from Google
structured their model.
<div align="middle">
<img src="/images/deepvariant.png" width=""></img>
</div>
As you can see from the diagram, their workflow proceeds as follows:
<ol>
<li>Identifying Candidate Variants and Generating Pileup: Candidate variants are identified according to a simple procedure. The
reads are aligned to the reference, and in each case the CIGAR string for the read is generated. Based on how that read
compares to the reference at the point which it is aligned, it is classified as either a match, an SNV, an insertion, or a
deletion. If the number of reads which differ at that base pass a pre-defined threshold, then that site is identified as a
potential variant. The authors also apply some preprocessing to ensure that the reads under consideration are of high enough
quality and are aligned properly in order to be considered. After candidate variants have been identified, the pileup image is
generated. The notable points here are that candidate variants are colored and areas with excessive coverage are downsampled
prior to image generation.</li>
<li>The model is trained on the data, then run during inference stages. They use a CNN, specifically the Inception v2 architecture,
which takes in a 299x299 input pileup image and uses an output softmax layer to classify into one of hom-ref, het, or hom-alt.
They use SGD to train with a batch size of 32 and RMS decay set to 0.9. They initialize the CNN to using the weights from one
of the ImageNet models.</li>
</ol>
The authors find that their model generalizes well, and can be used with no statistically significant loss in accuracy on reads aligned
to a reference different from the one on which <i>DeepVariant</i> was trained. They also stumble upon another surprising result, namely
that their model successfully calls variants on the mouse genome, opening up the model's to a wide diversity of species.
</br></br>
<h2>Conclusion</h2>
I hope this post shed some light on how variant calling works and elucidates the biological underpinnings of the process well enough
for newcomers to the field to start diving into some papers. In future posts, I'll expand on some of the other popular variant calling
models as well as introduce algorithms for related processes, such as the haplotype phasing discussed earlier.
Sun, 26 May 2019 00:00:00 +0000
http://mcneela.github.io/bioinformatics/2019/05/26/Variant-Calling-Methods.html
http://mcneela.github.io/bioinformatics/2019/05/26/Variant-Calling-Methods.htmlbioinformaticsA Bit About Manifolds<b> Work In Progress! </b>
<h1>A Bit About Manifolds</h1>
A <i>manifold</i>, in the broadest sense, is a structure which is locally homeomorphic to $\mathbb{R}^n$
at each of its open sets. In layman's terms, this means that while a manifold may be sufficiently "abstracted"
such that its elements look nothing like those of $\mathbb{R}^n$, we can regardless treat the manifold as if
it's $\mathbb{R}^n$, with a few potential caveats. But first, the formal definition.
</br></br>
<u><b>Definition 1</b></u> A <i>manifold</i> is a second countable Hausdorff space $M$ equipped with a collection of
"charts" (called an atlas) such that for each open $U \subset M$ there exists a homeomorphism $\phi: U \to V \subset \mathbb{R}^n$
where $V$ is itself open. This $\phi$ is called a chart. We require that each $x \in M$ be in the domain of some
chart. We require that the collection of charts be maximal subject to the conditions above.
</br></br>
If we want to be able to perform calculus on our manifolds, however, a bit more is required.
</br></br>
<u><b>Definition 2</b></u> A <i>smooth</i> ($C^\infty$) manifold is a manifold with the additional
restriction that the change of coordinates defined by the composition of one chart and another's inverse is itself
$C^\infty$. That is, we require for two charts $\phi$ and $\psi$ that given $\phi: U \to U' \subset \mathbb{R}^n, \psi: V \to V' \subset \mathbb{R}^n$, the following be continuously differentiable of all orders
$$\phi \psi^{-1} : \psi (U \cap V) \to \phi (U \cap V)$$
Both of these definitions look like quite the mouthful at first, but there's really little in the way of hidden complexity.
The defacto differentiability requirements on the change of coordinates mappings make inherent sense. If this differentiability
was not presupposed, there'd be nothing restricting the manifold from being put together as a patchwork of locally smooth
neighborhoods that don't inherently connect in a "nice" way, i.e. there may exist discontinuities at the points where they join.
</br></br>
The notion of second countability may throw some for a loop, so I'll take some time to review it. A topological space is called
second countable if it exhibits a countable base $\mathcal{B} = \bigcup_{i=1}^\infty B_i$. This is a property that's characteristic
of many conventionally "nice" mathematical spaces, and its presence in this definition is a natural consequence of the homeomorphicity
with $\mathbb{R}^n$. $\mathbb{R}^n$ is second countable, and second countability is a topological invariant. More concretely,
since there exists a countable collection of open sets $\mathcal{B}$ which form a base for $\mathbb{R}^n$, we can take this collection and form
its equivalent in $M$ by applying the appropriate inverse chart $U_i = \phi_i^{-1}(B_i)$ for $B_i \in \mathcal{B}$ to get a countable
base $\mathcal{U} = \bigcup_{i=1}^\infty U_i$ for $M$.
</br></br>
Similarly, when we look at the Hausdorff condition in the manifold definition, it follows naturally from the homeomorphic nature of
the neighborhoods of the manifold with those of $\mathbb{R}^n$. Recall that a Hausdorff space is one for which every pair of distinct
points there exists a corresponding pair of disjoint neighborhoods of those points. This is another standard regularity condition which
is also a topological invariant.
</br></br>
A slightly more general way of thinking about manifolds is to consider them as topological spaces endowed with some additional structure
which varies based on the specific type of manifold under consideration. We can denote this as a pair $(T, S)$ where $T$ is the space
and $S$ is the structure. For example, we can specify the smooth $n$-dimensional real manifold as the pair $(\mathbb{R}^n, C^\infty)$.
When viewed in this light, any map from one manifold to a counterpart which "satisfies" the underlying structure of the two spaces is
called a <i>morphism</i>. We can further refine this set by considering only those morphisms whose inverses are themselves morphisms.
A map satisfying this additional condition is called an <i>isomorphism</i>. As an example, for standard manifolds the canonical
isomorphisms are the homeomorphisms. For smooth manifolds, the isomorphisms are diffeomorphisms. To formalize this, we define the
following.
</br></br>
<u><b>Definition</b></u> Suppose $X$ is a topological structure. We say that the function $F_X$ is a <i>functional structure</i> on $X$
if and only if for all open $U \subset X$
i</br></br>
1. $F_X(U)$ is a subalgebra of the algebra of all continuous real-valued functions on $U$.</br>
2. $F_X(U)$ contains all constant functions.</br>
3. $V \subset U, f \in F_X(U) \implies f|_V \in F_X(V)$</br>
4. $U = \bigcup U_\alpha$ and $f|_{U_\alpha} \in F_X(U_\alpha)$ for all $\alpha \implies f \in F_X(U)$.
</br></br>
Then the notions of morphism and isomorphism can be formulated accordingly as...
</br></br>
<u><b>Definition</b></u> A <i>morphism</i> of functionally structured spaces
$$(X, F_X) \to (Y, F_Y)$$
is a map $\phi:X \to Y$ such that composition $f \to f \circ \phi$ carries $F_Y(U)$ into $F_X(\phi^{-1}(U))$. An isomorphism is a morphism $\phi$ such that $\phi^{-1}$ exists as a morphism.
</br></br>
<h2>Tangent Spaces, Differentials, and All Things Adjacent</h2>
The fundamental concept of differential calculus is the derivative, which allows us to compute the line tangent to a curve at any
given point. The analogue for manifolds is the differential, which allows us to define tangent vectors to arbitrary trajectories
along a surface. It will turn out that the set of these vectors in fact comprises a vector space
</br></br>
<u><b>Definition</b></u> Take a smooth manifold $M$ and a smooth, parametrized curve $\gamma: \mathbb{R} \to M$ satisfying
$\gamma(0) = p$. Let $f$ be a smooth function defined on an open neighborhood of $p$. Then we define the <i>directional derivative</i>
of $f$ along $\gamma$ at $p$ to be
$$D_\gamma(f) = \frac{d}{dt} f(\gamma(t))|_{t=0}$$
We call the operator $D_\gamma$ the <i>tangent vector</i> to $\gamma$ at $p$. For a point $p \in M$ we denote by $T_p(M)$ the vector
space of all tangent vectors to $M$ at $p$.
</br></br>
To understand the operator $D_\gamma$ in a more global context, it behooves us to introduce <i>algebras</i>. Algebras exist as a simple
extension of vector spaces. Specifically, they are vector spaces having an associated bilinear product. That is they take elements from
two vector spaces and the product mapping sends these to a third vector space. In order to understand this view of derivatives along
manifolds, we need to present the algebra on which the tangent operator is defined.
</br></br>
To do so, consider the following definition. Suppose you have a smooth, real-valued function $f:M \to Y$ defined at some point $p \in M$ where
$M$ is a smooth manifold. We call the equivalence class of $f$ defined under the equivalence relation
$f_1 \sim f_2 \iff f_1(x) = f_2(x)$ for all $x$ in a neighborhood $U$ of $p$ a <i>germ</i>. We can extend the notion
of a germ to sets $S$ and $T$ via the relation $S \sim_p T$ if there exists a neighborhood $U$ of $p$ such that
$$S \cap U = T \cap U \neq \emptyset$$
The way we're defining a germ here is more specific than is necessary. Germs are
a more general concept which capture notions of local equivalence via some property for objects acting on a topological space.
Most often these objects are functions or maps with some sort of additional structure, e.g. smoothness or continuity, although this is
not strictly necessary. To define a germ one needs only a notion of equality and a topological space, as these are the most abstract
structures which allow us to define notions of locality via neighborhoods. When discussing germ equivalence between two maps $f$ and
$g$ we can consider that these maps need not be defined on the same domain, provided some caveats are satisfied. We require that
if $f$ has domain $S$ and $g$ has domain $T$ then $S$ and $T$ are germ equivalent via the definition for sets given above. We also
require that $f|_{S \cap V} = g|_{T \cap V}$ for some smaller neighborhood $V$ of $p$ satisfying $p \in V \subseteq U$.
</br></br>
We denote the germ at $p$ of a function $f$ as $[f]_p$ and we denote the equivalence relation it defines as $f \sim_p g$.
</br></br>
A <b>$K$-linear derivation</b> $D:A_1 \to A_2$ where $A_1, A_2$ are algebras is a a $K$-linear map satisfying the product rule
$$D(ab) = aD(b) + D(a)b$$
<h3>Defining Tangent Vectors via Derivations</h3>
Suppose $M$ is a $C^\infty$ manifold. For any $x \in M$ we define a <b>derivation</b> by choosing a linear map $D:C^\infty(M) \to
\mathbb{R}^n$ satisfying
$$\forall f, g \in C^\infty(M) \quad D(fg) = f(x)D(g) + D(f)g(x)$$
We can define addition and scalar multiplication on the set of derivations in the same way we do for elements of a vector space.
We call the vector space obtained the tangent space to $x$ in $M$ and denote it as $T_x(M)$.
</br></br>
Let $\gamma: (-1, 1) \to M$ be a differentiable curve with $\gamma(0) = x$. Then the derivation $D_\gamma$ at $x$ is defined by
$D_\gamma(f) := (f \circ \gamma)'(0)$
</br></br>
<h3>Getting Infinitesimal with Differentials</h3>
When calculus is taught, the notion of a <i>differential</i> is frequently introduced and is usually vaguely described as the
infinitesimal change in some variable. Haphazard teachers may introduce the chain rule by making generalizations such as
that one can cancel differentials like $dx$ in the equation
$$\frac{df}{dz} = \frac{df}{dx} \frac{dx}{dz}$$
but this isn't entirely true, or at least it doesn't tell the whole story. We also see the differential pop up in the notation for
integration, i.e. we write $\int_{a}^b f(x) \ dx$ but we never really explain what the differential is and what it's doing there.
</br></br>
The general structure in which differential geometry operates makes things much clearer. What we'll see is that the differential
is actually an object which instructs us in how to map between the tangent spaces of different manifolds. The differential is
defined with respect to such a smooth map. In fact, the differential is an operator that takes in a map between manifolds and
spits out a map between their tangent spaces. For more specifics, let's get to the definition.
</br></br>
<b><u>Definition</u></b> Let $\phi : M \to N$ be a smooth map between manifolds. Then we say the the differential of $\phi$ is
the linear map $d\phi$ satisfying
$$d\phi(D)(g) = D(g \circ \phi)$$
and also satisfying
$$d\phi d\psi = d(\phi \circ \psi)$$
<h3>Taking this Tangent Thing a Step Further - Tangent Bundles</h3>
Before giving the formal definition, I'd like to give a brief intuitive explanation of what a tangent bundle constitutes. The
tangent bundle is unique in that it is itself a manifold $T(M)$, obtained by construction from the tangent spaces at each point
of some different manifold $M$. It can also be thought of as a cartesian product between points $p \in M$ and the associated tangent
vectors $v_p \in T_p(M)$. The cartesian product formulation naturally gives rise to a projection map $\pi:T(M) \to M$ from the
tangent bundle back to the manifold from whence it was constructed. Formally, we define the <i>tangent bundle</i> as the union of
tangent spaces
$$T(M) = \bigcup \{T_p(M) | p \in M\}$$
However, to make the manifold valid, we still need to specify an appropriate set of charts.
A manifold is said to be <i>parallelizable</i> if its tangent bundle is trivial.
<h3>An Instance of a More General Phenomenon</h3>
The tangent bundle is a specific instance of a concept called the <i>vector bundle</i>, which provides a method for constructing
a vector space parameterized by a different sort of space such as a topological space $X$ or a manifold $M$. One of the characteristics
of vector bundles is that they are <i>locally trivial</i>. They are also consist of a base space $V$ and a total space $M$. The base
space is formed as a family of vector spaces over points defined in $M$, and that family is required to be smoothly varying.
The formal definition involves both a base space $B$ and a total space $E$. A tangent
<h3>A General Instance of an Even More General Phenomenon</h3>
A vector bundle, as general as it is, is actually an instance of an even more general general phenomenon. The idea is we replace the
<i>vector</i> part of <i>vector bundle</i> with a fiber of some mapping. This mapping can be anything we want, subject to a few
conditions. This is used to construct spaces which look locally like product spaces but from a global view have a more general
topological structure.
</br></br>
<b><u>Formal Definition</u></b> A <i>fiber bundle</i> is a tuple $(B, E, \pi, F)$ consisting of a base space $B$, a total space $E$,
a continuous surjective projection map $\pi:E \to B$ and a fiber $F$, all satisfying the following
</br></br>
1. For every $x \in E$, there exists an open neighborhood $U \subset B$ of $\pi(x)$ such that there exists a homeomorphism
$\varphi: \pi^{-1}(U) \to U \times F$
Wed, 30 Jan 2019 00:00:00 +0000
http://mcneela.github.io/math/2019/01/30/A-Bit-About-Manifolds.html
http://mcneela.github.io/math/2019/01/30/A-Bit-About-Manifolds.htmlmathThe Banach Contraction Principle<h1>The Banach Contraction Principle (Banach Fixed-Point Theorem)</h1>
I wanted to write about the Banach Contraction Principle (from here out the BCP) because of
the intuitive mathematical beauty it illustrates. It's proof is elegant, but even more than
that, the theorem provides intuition into problems from fields diverse as ODEs and differential
geometry.
</br></br>
<u><b>Theorem Statement</b></u> If $X$ is a complete metric space and $T:X \to X$ is a contraction,
then $T$ has a unique fixed point $a \in X$. Futhermore, for any $x \in X$, $a = \lim T^i (x)$.
</br></br>
<i>Proof:</i> Before starting, it should be noted that a map $T$ is a <i>contraction</i> if for
some constant $K < 1$, we have $d(Tx, Ty) \leq K \dot d(x, y)$ for all $x, y \in X$.
</br></br>
Now, to begin, fix any point $x_0 \in X$ and set $x_i = Tx_{i-1}$. Letting $\delta = d(x_0. x_1)$
we have
\begin{align*}
d(x_0, x_1) &\leq d(x_0, x_1) + d(x_1, x_2) + \cdots _ d(x_{i-1}, x_i) \\
&\leq d(x_0, x_1) + K \cdot d(x_0, 1) + K^2 d(x_0, x_1) + \cdots \\
&= \delta(1 + K + K^2 + \cdots) \\
&= \delta/(1-K)
\end{align*}
Note that for $m \geq n$, the following holds:
$$d(x_m, x_n) \leq K \cdot d(x_{n-1}, x_{m-1}) \leq K^2 \cdot d(x_{n-2}, x_{m-2}) \leq \cdots \leq K^n \cdot d(x_0, x_{m-n}) \leq \delta K^n/(1-K)$$
which limits to 0 since $K < 1$.
</br></br>
The key insight here is that the $\{x_i\}_{i\in \mathbb{N}}$ form a Cauchy sequence.
Since we assumed that $X$ is complete, this sequence converges and we can assign $a = \lim_{\i \to \infty} x_i$.
Thus $Ta = \lim T x_i = \lim x_{i+1} = a$.
</br></br>
Now we need only show the uniqueness of $a$. Suppose $b$ was another fixed point. Then we'd have
$$d(a, b) = d(Ta, Tb) \leq K \cdot d(a, b)$$
And since $K < 1$, we get $d(a, b) = 0$ implying that $a = b$. $\blacksquare$
</br></br>
<!-- I find the best way to intuitively visualize this theorem is as follows. Construct in your head some Cauchy sequence
$\{x_i\} \subset X$ in your head with the added constraint that $d(x_i, x_{i+1}) \geq d(x_j, x_{j+1})$ for all $i \leq j$.
In other words, the distance between consecutive points in the sequence limits monotonically towards 0.
-->Sat, 28 Jul 2018 00:00:00 +0000
http://mcneela.github.io/math/2018/07/28/Banach-Contraction-Principle.html
http://mcneela.github.io/math/2018/07/28/Banach-Contraction-Principle.htmlmathA Tutorial on the REINFORCE (aka Monte-Carlo Policy Differentiation) Algorithm<script type="text/x-mathjax-config">
MathJax.Hub.Config({ TeX: { extensions: ["AMSmath.js"]}});
</script>
<h1>The REINFORCE Algorithm aka Monte-Carlo Policy Differentiation</h1>
The setup for the general reinforcement learning problem is as follows.
We're given an environment $\mathcal{E}$ with a specified state
space $\mathcal{S}$ and an action space $\mathcal{A}$ giving the
allowable actions in each of those states. Each action $a_t$
taken in a specific state $s_t$ yields a particular reward
$r_t = r(s_t, a_t)$ based off a reward function $r$ that's in some way implicitly defined by the environment.
We'd like to choose a policy $\pi$ giving a probability
distribution of actions over states $\pi: \mathcal{S} \times
\mathcal{A} \to [0, 1]$. In other words, $\pi(a_t \mid s_t)$ gives the probability of taking action $a_t$ in state $s_t$.
</br></br>
Since the whole problem of RL boils down to formulating an
optimal policy which maximizes reward, we can define an
objective function that explicitly quantifies how a policy
fares in accomplishing this goal. First, we assume that we
are using some sort of function approximator (e.g. a neural
network) to obtain an approximation to the policy $\pi$ and
we assume also that this approximator is governed by some set
of parameters $\theta$. We say that the policy $\pi_\theta$ is
<i>parametrized by</i> $\theta$.
</br></br>
Define our objective function $J$ by
$$J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right]$$
In shorthand, our objective function returns the expected
reward achieved by a given policy $\pi_\theta$ over some time
horizon governed by $t$ (can be either finite or infinite).
We write $\tau \sim p_\theta(\tau)$ to indicate that we're
sampling trajectories $\tau$ from the probability distribution
of our policy approximator governed by $\theta$. This
distribution can be calculated by decomposing into a product of
conditional probabilities, i.e.
$$p_\theta(\tau) = p_\theta(s_1, a_1, \ldots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$$
We can now specify the optimal policy $\pi^* = \pi_{\theta^*} =
arg\,max_\theta J(\theta)$.
</br></br>
That's all well and good, but the question becomes, how do we
break down our objective function to something tractable. We
need a way to accurately approximate that expectation, which
in its exact form involves an integral over a probability
distribution defined by our parametrized policy which we don't
have access to. To do this, we can use something called
<b>Monte-Carlo Approximation</b>. The idea is simple and is
predicated on the following fact
$$\lim_{N \to \infty} \frac{1}{N}\sum_{i=1}^N f(x_i)_{x_i \sim p(x)} = \mathbb{E}[f(x)]$$
Thus, if we sample $f(x_i)$, drawing $x_i$ from the probability
distribution $p(x)$, $N$ times where $N$ is large but finite, we
obtain a decent approximation to $\mathbb{E}[f(x)]$. Using
Monte-Carlo approximation, we can rewrite our objective function
as
$$J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t} r(s_{i, t}, a_{i, t})$$
where the $N$ samples are being directly drawn from the
probability distribution defined by $\pi_\theta$ simply by
running $\pi_\theta$, $N$ times.
</br></br>
Now that we have a tractable objective function, we still need
to determine how best to iteratively permute our $\theta$
parameter values so as to arrive at the optimal setting
$\theta^*$. The simplest approach is to perform gradient ascent
on $J(\theta)$ (since we're taking $arg\,max$ over $\theta$).
That means it's time to take some gradients. To simplify
notation a bit, define the reward of a trajectory $\tau$ as
$$r(\tau) = \sum_{t} r(s_t, a_t)$$
Then we can rewrite $J(\theta)$ as
$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}[r(\tau)]
= \int \pi_\theta(\tau) r(\tau)\ d\tau$$
To get our gradient ascent update formula, we take the gradient
of $J$ with respect to $\theta$ to get
$$\nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau)r(\tau)\ d\tau$$
To get rid of the intractable integral, we can use a clever
substitution. Note that
$$\nabla_\theta \pi_\theta(\tau) = \pi_\theta
\frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)}
= \pi_\theta(\tau) \nabla_\theta \log{\pi_\theta(\tau)}$$
Plugging that into our expression for $\nabla_\theta J(\theta)$
gives
$$\nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta(\tau) r(\tau)\ d\tau = \int \pi_\theta(\tau) \nabla_\theta \log{\pi_\theta} (\tau) r(\tau)\ d\tau = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} [\nabla_\theta \log{\pi_\theta(\tau)r(\tau)}]$$
We've turned a gradient of an expectation into an expectation
of a gradient, which is pretty cool, but we need to reduce
things even further. Recall from before that the probability
distribution $\pi_theta$ defines over trajectories $\tau$ is
given as
$$\pi_\theta(\tau = s_1, a_1, \ldots, s_T, a_T)
= p(s_1) \prod_{t=1}^T \pi_\theta (a_t \mid s_t) p(s_{t+1}
\mid s_t, a_t)$$
Taking logs of both sides breaks this down into a nice,
convenient sum.
$$\log \pi_\theta(\tau) = \log p(s_1) + \sum_{t} \log \left[
\pi_\theta(a_t \mid s_t) + \log p(s_{t+1} \mid s_t, a_t) \right]$$
Note that $\nabla_\theta \log \pi_\theta(\tau)$ is a term
in our revised expression for $\nabla_\theta J(\theta)$ so
we'd like to take the gradient of the previous formula and
substitute that back in.
\begin{align*}
\nabla_\theta \log{\pi_\theta(\tau)} &= \nabla_\theta \left[ \log{p(s_1)} + \sum_{t}
\log{\pi_\theta(a_t \mid s_t)} + \log{p(s_{t+1} \mid s_t, a_t)} \right] \\
&= \sum_{t} \nabla_\theta \log{\pi_\theta(a_t \mid s_t)}
\end{align*}
We were able to eliminate all but the middle term because the
others did not depend on $\theta$.
</br></br>
Finally, we can plug this whole thing back into our expression
for $\nabla_\theta J(\theta)$ to get
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}
\left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$
Once again, we're left with an expectation. Like before, we
can use Monte-Carlo Approximation to reduce this to a
summation over samples. This gives
$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_{i, t} \mid s_{i,t})\right) \left(\sum_t r(s_{i,t}, a_{i,t})\right)\right]$$
Now that we've finally reduced our expression to a usable form,
we can update $\theta$ at each timestep according to the gradient ascent update rule
$$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$
Now that we've derived our update rule, we can present the
pseudocode for the REINFORCE algorithm in it's entirety.
</br></br>
<b>The REINFORCE Algorithm</b>
<ol>
<li>Sample trajectories $\{\tau_i\}_{i=1}^N from \pi_{\theta}(a_t \mid s_t)$ by running the policy.
<li>Set $\nabla_\theta J(\theta) = \sum_i (\sum_t \nabla_\theta \log \pi_\theta(a^i_t \mid s^i_t)) (\sum_t r(s^i_t, a^i_t))$
<li>$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
</ol>
And that's it. While the derivation of the gradient update
rule was relatively complex, the three-step algorithm is
itself conceptually simple. In upcoming tutorials, I'll
identify how to improve the REINFORCE algorithm with strategies
which minimize variance.Wed, 18 Apr 2018 00:00:00 +0000
http://mcneela.github.io/math/2018/04/18/A-Tutorial-on-the-REINFORCE-Algorithm.html
http://mcneela.github.io/math/2018/04/18/A-Tutorial-on-the-REINFORCE-Algorithm.htmlmath