The Personal Website of Daniel McNeelaThe personal blog, resume, and portfolio of Daniel McNeela. I am a computer scientist, mathematician, and writer.
http://mcneela.github.io/
Thu, 30 Nov 2023 16:08:26 +0000Thu, 30 Nov 2023 16:08:26 +0000Jekyll v3.9.3Why Adding Epsilon Works<h1 id="why-adding-epsilon-works">Why Adding Epsilon Works</h1>
<p>When students new to machine learning begin their journey, it is often stated to them (without proof) that a good way to
ensure that a matrix $A$ is invertible is to add $\epsilon I$ to it, where $\epsilon$ is some small, positive constant.
Of course, this simply amounts to adding $\epsilon$ along the diagonal of $A$. Making $A$ invertible without significantly
changing its norm is necessary for the success of a variety of ML algorithms that rely on matrix inversion to obtain an optimal
solution to some learning problem, as is the case in linear regression. This “epsilon trick” is often used unreservedly by practitioners,
although often without an understanding of <strong>why</strong> it works. Here I give a proof of the validity of this technique using a
theorem from differential geometry known as <em>Sard’s Theorem</em>. Here is the statement of Sard’s Theorem from John Lee’s book on smooth manifolds.</p>
<p><strong>Sard’s Theorem</strong> <em>Suppose</em> $M$ <em>and</em> $N$ <em>are smooth manifolds with or without boundary and</em> $F : M \to N$ <em>is a smooth map. Then the set of critical values of</em> $F$ <em>has measure zero in</em> $N$.</p>
<p>To see how this theorem applies in our case, consider the determinant function, $\det : M_n(\mathbb{R}) \to \mathbb{R}$ where $M_n(\mathbb{R})$ is the space of
real-valued $n \times n$ matrices. It’s an elementary fact that $M_n(\mathbb{R})$ and $\mathbb{R}$ are both smooth manifolds. Since $\det$ can be expressed as a
polynomial function of a matrix’s entries, it is a $C^{\infty}$ smooth function. Now, we know a matrix is singular if and only if its determinant is zero,
i.e. if $A$ is a critical value of the determinant function. If we let $X = \left\{A \in M_n(\mathbb{R}) : \det(A) = 0 \right\}$, then by Sard’s Theorem,
$\mu(X) = 0$ so the complement of $X$, which is $GL_n(\mathbb{R})$, is dense in $M_n(\mathbb{R})$.
Thus, setting $A \leftarrow A + \epsilon I$ will give a matrix $A \in GL_n(\mathbb{R})$ for almost every $\epsilon > 0$.</p>
Sat, 25 Nov 2023 00:00:00 +0000
http://mcneela.github.io/machine_learning/2023/11/25/Why-Adding-Eps-Works.html
http://mcneela.github.io/machine_learning/2023/11/25/Why-Adding-Eps-Works.htmlmachine_learningIntroduction to Smooth Manifolds -- Some Solutions<!--more-->
<p><a href="http://mcneela.github.io/math/ism.pdf">Solutions</a></p>
Fri, 19 May 2023 00:00:00 +0000
http://mcneela.github.io/math/2023/05/19/Introduction-to-Smooth-Manifolds.html
http://mcneela.github.io/math/2023/05/19/Introduction-to-Smooth-Manifolds.htmlmathPlanning for Productivity in 2023<h1 id="planning-for-productivity-in-2023">Planning for Productivity in 2023</h1>
<p>2022 was a pretty good year for me. I accomplished a number of my goals (such as finishing my MS requirements!),
and I improved in a number of personal and professional areas. However, organization and planning has always been a slightly weaker
point for me, and I’m working to remedy that this year by investing in a number of productivity tools and systems
that I’m hoping will help me to accomplish my goals for 2023! This article is as much for myself as others as I
trial what does and doesn’t work for me in an effort to build systems that will allow me to be more productive,
grounded, and centered than I’ve ever been!</p>
<h2 id="daily-habits">Daily Habits</h2>
<p>Daily habits form the foundation of long-term goals. In an effort to build better daily habits, I am using the
<a href="https://streaksapp.com/">Streaks App</a>. What I love most about the app is its simplicity. It allows you to set
daily goals and tracks basic statistics on your accomplishment of those goals on a daily, monthly, and yearly
basis. I also love that the app is available as a one-time $5 purchase rather than as an indefinite subscription.
The app caps you at 24 habits, which is honestly a good thing. In the past, I’ve tried to set far too many daily
habits simultaneously, which has led to inconsistency and poor long-term habit adoption. I currently have only
12 daily habits set, 4 of which are passive (e.g. don’t drink coffee). I’m hoping to pare down the remaining 8
habits to the 4 or 5 most essential and only begin to add more once I’ve consistently adhered to those.</p>
<h2 id="fiction-and-nonfiction-reading">Fiction and Nonfiction Reading</h2>
<p>As usual, I plan to track my reading progress on my <a href="https://www.goodreads.com/user/show/44168870-daniel-mcneela">Goodreads</a>.
I’ve set a goal to read 30 books this year, about 50% more ambitious than the 19 books I managed to read last year.</p>
<h2 id="writing">Writing</h2>
<p>I plan to use <a href="https://www.literatureandlatte.com/scrivener/overview">Scrivener</a> to help me accomplish my daily writing goals.
As I’m working on a novel, Scrivener allows me to set daily word count writing targets as well as global project targets. This
year I’m setting a daily word count target of 1,000 words and a project target of 100,000 words. I also am targeting
writing at least one blog post per week, although that may be difficult to accomplish on top of my other writing-related goals.</p>
<h2 id="academic-literature-and-notetaking">Academic Literature and Notetaking</h2>
<p>I’m using <a href="https://www.notion.so/">Notion</a> to track academic papers and articles that I’d like to read, as well as to
take notes about said articles. I also hope to use it as a miscellaneous tracker for Youtube videos, webpages, and
other educational content that I’d like to consume.</p>
<h2 id="strava">Strava</h2>
<p>As always, I will be using Strava to track my running, biking, weight training, and yoga activities.</p>
<h2 id="letterboxd">Letterboxd</h2>
<p>I’m hoping to make a habit of consuming more meaningful content in 2023.
This means watching meaningful films rather than mindless TV series.
I will be tracking the films I’ve watched (and want to watch) on
<a href="https://letterboxd.com/">Letterboxd</a>.</p>
Wed, 11 Jan 2023 00:00:00 +0000
http://mcneela.github.io/productivity/2023/01/11/Productivity-Tips-2023.html
http://mcneela.github.io/productivity/2023/01/11/Productivity-Tips-2023.htmlproductivityMy Covid Bike Journey<h1 id="my-covid-bike-journey">My Covid Bike Journey</h1>
<h2 id="first-attempt-salsa-cutthroat">First Attempt: Salsa Cutthroat</h2>
<p>Like seemingly everyone else, I decided to finally buckle down and buy a bike during Covid times. I had been meaning
to get involved in the triathlon scene for a while and I had been watching a lot of <a href="https://www.youtube.com/channel/UCVcUzl95VwxrIEQnu9xI21g">Ryan Van Duzer</a> videos on Youtube
(great channel btw). I started looking at bikes late, probably around August or September 2020. I wanted my first bike to be something versatile,
preferably a touring or adventure bike as I had gotten really into the bikepacking craze and wanted something that I could ride equally well on
the road or on trails. I had been fantasizing about training for <a href="https://www.youtube.com/watch?v=xqCYE-Smqf4">The Great Divide Mountain Bike Route</a>,
and I eventually found that the Salsa Cutthroat bike had been designed specifically for this route. I was sold. Only problem - it’s a $3000+ bike, and due to
Covid supply chain issues, they seemed to all be sold out.</p>
<p>Well, I gave up on the idea of getting a bike for a while after seeing all the supply shortages, until one fateful day in 2020 I stumbled upon a used 2019 Salsa Cutthroat
at a local bike shop. The great thing is it was priced at a steep discount ($1000 off), and it didn’t seem to be selling. I took it out for a few test rides and I fell in
love. For those who are unaware, the Cutthroat is Salsa’s carbon version of a dropbar mountain bike. It’s basically a Carbon Salsa Fargo designed for adventure racing.</p>
<p>I did notice an issue with this particular bike that made me pause. It was a size “Medium” which apparently corresponded to a 56cm top tube. Based on my height and dimensions,
I should normally ride a 54cm top tube. I did feel a bit stretched out when test riding it, but I brushed away my concerns. I figured that 2cm was not a huge difference, and since
I was new to biking I figured that I didn’t really know what size I was anyway. Surely I could make a 56cm work for me. Plus I knew there was a supply shortage and the chances of
me finding a 54 were slim to none. I bought the bike and brought it home.</p>
<figure>
<img src="/images/salsa_cutthroat.jpg" width="800" />
<figcaption>My Salsa Cutthroat</figcaption>
</figure>
<p>Big mistake. If you’re ever in the bike market, don’t let a great deal sway you into buying a bike that isn’t the right fit for you. I rode the bike a few times but it always
felt a bit off to me. I felt somewhat stretched out and I had basically zero standover (bad news if you ever need to jump or fall off your bike). There were also some flaws
to my idea that I could buy “one bike to rule them all”. Since I don’t have a lot of rideable mountain biking near me, I was riding mostly on roads. However, I could never
get a good workout due to the 1x system on the Cutthroat. I think it sort of makes sense to have a 1x on a racing mountain bike, but personally I prefer 2x systems every time.
I don’t mind a bit of extra weight if it means I get more useable gears. Especially if I’m going to be riding at all on the roads. If you’re reading this, Salsa, you should make
a Cutthroat with a 2x drivetrain! Eventually I ended up returning the bike and resigning myself to a life of wheelessness. However, once the supply chain gets back to normal,
I’d love to search around for another Cutthroat or Fargo which fits me properly.</p>
<h2 id="other-adventure-bikes-i-tried">Other Adventure Bikes I Tried</h2>
<p>I tried quite a few Kona bikes, such as the Rove, the Libre, and the Sutra (my favorite). However, they didn’t seem to quite fit me as well as the salsa bike did (even though it didn’t fit).
The Kona bikes had weird sizing I thought and I always felt really stretched out on them, to the point where it seemed like I’d be uncomfortable on a longer ride. However, they still seemed
like great bikes and they were likely just not a good fit for my body dimensions.</p>
<h2 id="a-new-hope-my-road-bike">A New Hope: My Road Bike</h2>
<p>Well, eventually my itch to bike grew too strong and I expanded my search outside of San Diego. Eventually I found a couple of bike shops in Orange County that had some road bikes I was
interested in. I first went to look at the Specialized Allez which I had tried in San Diego but had hesitated too long on buying and missed out on. I think the Allez is a great entry
level road bike - really solid components for the price point. However, I tried both 52cm and 54cm versions (their 54 is more like a 55), and neither felt quite right. I do have to
compliment the team at Specialized Costa Mesa though - they were super great about helping me try out the bike (they built the 52cm after I put a deposit, specifically so I could try it),
and they gave great fitting advice. I will definitely go back there the next time I’m in need of a Specialized product.</p>
<p>Ultimately though, I found a bike that was a better fit for me at a local Costa Mesa bike shop: the Scott Speedster 20 Disc. I purchased it, and it’s been great to me so far. There are
some minor complaints, but it is an entry level bike after all (I believe I spent $1200), so it’s not going to have all the bells and whistles. But for the price, it’s a steal. The disc
brakes are great, it has a carbon fork (although aluminum frame), and 20 gears (2x10) is more than enough for me at the moment. It’s billed as more of an endurance bike, so the gear
ratio on the cassette is really wide and I don’t find myself using the lower gears for climbing very often. That is one thing I’m hoping to do; I want to swap out the cassette for one
with narrower gaps between the gears. All in all it’s been great to me though, and the color scheme is absolutely sick! I think Scott bikes are super underrated, and when I do eventually
upgrade I’d love to get the Scott Addict.</p>
<figure>
<img src="/images/scott_speedster.jpg" width="800" />
<figcaption>My Scott Speedster</figcaption>
</figure>
Tue, 10 Aug 2021 00:00:00 +0000
http://mcneela.github.io/bikes/2021/08/10/My-Covid-Bike-Journey.html
http://mcneela.github.io/bikes/2021/08/10/My-Covid-Bike-Journey.htmlbikesSubgradient Descent<p><i>This post was originally featured on my other blog at <a href="https://flamethrower.ai/blog/206669/subgradient-descent">Flamethrower AI</a></i></p>
<h1>Introduction</h1>
<p>You've probably heard of the gradient descent algorithm. It's probably the most widely used algorithm
in machine learning, used to train everything from neural networks to logistic regression. What you probably
didn't know though, is that it doesn't always work. That's right, gradient descent has some preconditions that
your loss function needs to satisfy in order for the algorithm to run. One of these is that the loss function
must be differentiable. That means that when you need to optimize a loss function that's not differentiable,
such as the L1 loss or hinge loss, you're flat out of luck.</p>
</br>
<p><i>Or are you?</i> Thankfully, you're actually not. There's a little-known method that's been around for a long
time within the field of convex optimization that uses the notion of <i>subgradients</i> to perform optimization.
We're going to tackle the theory behind it in this post today, then build our own implementation of it in Python
in order to create our own soft-margin SVM classifier with hinge loss to classify text messages as spam or ham.
Let's get started.</p>
<h1>The Theory</h1>
<h2>The Definition of a Subgradient</h2>
<p>A subgradient is a generalization of the gradient which is useful for non-differentiable, but convex, functions.
The basic idea is this: When a function is convex, you can draw a line (or in higher dimensions, a hyperplane)
through each point $f(x)$ and that line will
underapproximate the function everywhere. This is baked into the definition of convexity. In fact, for many points
$x$ there is more than one such line. A subgradient is simply any one of these lines, and it is defined mathematically
as
$$g \in \mathbb{R}^n \text{ such that } f(z) \geq g^\top (z - x) \text{ for all } z \in \text{dom}(f)$$
The definition can be a little bit confusing, so let me break it down piece by piece. The vector $g$ is the subgradient
and it's also what's called a <i>normal vector</i>.
It lies perpendicular to the hyperplane which underapproximates $f(z)$. The normal vector is all that's needed to define
the angle of the hyperplane, and by varying $z$, we can land at any point on this hyperplane.</p>
</br>
<p>Another way of characterizing the subgradient $g$ is via what's called the <i>epigraph</i> of $f$. The epigraph of $f$
is just all the area above and including the graph of $f$. In other words, it's the set you'd get if you colored in above
the graph of $f$. We say that $g$ is a subgradient if the hyperplane it defines <i>supports</i> the epigraph of $f$.
</p>
</br>
<p>When we say "support" here, it has a particular meaning and is not to be confused with the support of a function in an
analysis sense. More concretely, a hyperplane is said to support a set $S$ if it intersects $S$ in at least one point
(for our purpose, this point is $f(x)$) and the remainder of the entire set $S$ lies on one side of the hyperplane.
Because the epigraph of a convex function lies above its subgradient at every point, this is a natural way to conceptualize
things.</p>
</br>
<h2>Subdifferentials</h2>
<p>Now that we've introduced subgradients, lets move onto a broader concept, that of a <i>subdifferential</i>. We say that
a function $f$ is <i>subdifferentiable</i> if for each point $x \in \text{dom}(f)$ there exists a subgradient $g_x$ at
that point. Now, its important to note that the subgradient $g_x$ for a point $x$ is not unique. In fact, for any point
$x$, there can exist either 0, 1, or infinitely many such subgradients. We call the set of all such subgradients at a point
$x$ the <i>subdifferential</i> of $f$ at that point, and we denote it as $\partial f(x)$. In other words,
$$\partial f(x) = \{g_x \mid f(z) \geq g_x^\top (z - x)\quad \forall z \in \text{dom}(f)\}$$
So why talk about subdifferentials? Well, they have some nice properties and integrate nicely with the regular theory of
differentiation. Allow me to explain.</p>
</br>
<p>First off, the subdifferential $\partial f(x)$ associated with a point $x$ is <b>always</b> a closed, convex set. This
follows from the fact that it can be viewed as the intersection of a set of infinite half-spaces.
$$\partial f(x) = \bigcap_{z \in \text{dom}(f)} \{g_x \mid f(z) \geq g_x^\top (z-x)\quad \forall z \in \text{dom}(f)\}$$
This means the problem of finding the optimal subgradient $g_x$ at a given point $x$ is a solvable one, because doing
optimization over convex sets is a solved problem.</p>
</br>
<p>The more important point, though, is that the subdifferential satisfies many of the same laws of calculus as the
standard differential. Concretely, much like regular differentials, we can
<ul>
<li>Scale differentials by constants - $\partial f(\alpha x) = \alpha \partial f(x)$</li>
<li>Distribute over sums, integrals, and expectation - $\partial \int f(x)\ dx = \int \partial f(x)\ dx$</li>
</ul>
</p>
</br>
<p>Another important point is the direct connection of gradients and subdifferentials. If a function $f$ is differentiable
at a point $x$, then there exists only one subgradient $g_x$ to $f$ at $x$ and for the subdifferential we have
$$\partial f(x) = \{g_x = \nabla f(x)\}$$
In other words, there is a direct correspondence between and uniqueness of gradients and subgradients in the case where
$f$ is differentiable. Thus in some sense, the subgradient is a generalization of the gradient that applies in more general situations,
(e.g. nondifferentiability).</p>
<h2>Enough with the Definitions Already</h2>
<p>Okay, let's move on from tedious definitions and see how we can start using subgradients to solve a problem that's
typically intractable with standard, gradient-based methods. For this, let's take a look at the $\ell_1$ norm.</p>
</br>
<p>The $\ell_1$ norm is defined as
$$\|x\|_1 = \sum_{i=1}^n |x_i|$$
It is a nondifferentiable and convex function of $x$. This means that while we can't tackle its optimization with gradient descent,
we can approach it using the method of subgradients. Let's see how that works.</p>
</br>
<p>Before we get started, we need to recite a quick fact about subgradients. Namely, if we have a sequence of functions $f_1, \ldots, f_n$
where each of the $f_i$ are convex and subdifferentiable. Then their pointwise maximum $f(x)$, defined as
$$f(x) = \max_{i} f_i(x)$$
is also subdifferentiable and has
$$\partial f(x) = {\bf Co} \left(\bigcup \{\partial f_i(x) \mid f_i(x) = f(x)\} \right)$$
In other words, to find the subdifferential of the max, $f(x)$, we take the convex hull of the union of subdifferentials
for each function $f_i(x)$ which attains the pointwise maximum of $\{f_1, \ldots, f_n\}$ at each point $x$.
It's a little bit of a convoluted and long-winded statement, but hopefully it will make more sense when we apply it to the
problem of calculating the subgradient of the $\ell_1$ norm.</p>
</br>
<h3>Applying the Rule</h3>
<p>In order to apply this rule of maximums, we have to find a way to rewrite the $\ell_1$ norm as a pointwise maximum of
convex, subdifferentiable functions. This is actually pretty easy to do. To see why, consider the vector
$$s = [s_1, \ldots, s_n] \quad s_i \in \{-1, 1\}$$
where each of the $s_i$ is either equal to 1 or -1. The idea is you could set $s_i = -1$ when $x_i < 0$ and $s_i = 1$
when $x_i > 0$. When $x_i = 0$, either $s_i = 1$ or $s_i = -1$ can be used. Thus, we have
$$\ell_1(x) = \max \{s^\top x | s_i \in \{-1, 1\} \}$$
Since we've rewritten the $\ell_1$ norm as a max of functions, we can now apply the rule to find its subgradient.
Since each of the $f_i$ just have the form $s^T x$, they are each differentiable with unique (sub)gradient $s$.
Therefore the subgradient of $f = \max_i f_i$ is just the convex hull of this set of subgradients, which can be
expressed as
$$\{g \mid \|g\|_\infty \leq 1, g^Tx = \|x\|_1\}$$
</p>
<h2>The Subgradient Method</h2>
<p>Now, we arrive at our culminating theoretical moment, that of applying subgradients to problems of convex optimization.
We do so using what's called the <i>subgradient method</i> which looks almost identical to gradient descent. The algorithm
is an iteration which asserts that we make steps according to
$$x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)}$$
where $\alpha_k$ is our learning rate. There are a few key differences when compared with gradient descent though. The first
of these is that our $\alpha_k$ must be fixed in advance and not determined dynamically on the fly. The next is that the subgradient
method is not a true descent method. This is because each subsequent step $x^{(k+1)}$ is not guaranteed to decrease our objective
function value. In general, we keep track of our objective values and pick their max at the end of iteration, i.e. we set
$$f^{\star} = \max_k \{f^{(k)} = f(x^{(k)})\}$$</p>
<p>To see why the objective function value is not guaranteed to decrease with each iterate, we need only look to the definition
of a subgradient. Namely, we have
$$f(x^{(k+1)}) = f(x^{(k)} - \alpha_k g^{(k)}) \geq f(x^{(k)}) + {g^{(k)}}^\top(x^{(k)} - \alpha_k g^{(k)} - x^{(k)}) =
f(x^{(k)}) - g^{(k)}\alpha_k g^{(k)} = f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$$
Since $f(x^{(k+1)})$ can be any value greater than $f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$ including $f(x^{(k)}) + C$ for
some positive constant $C$, it's clear that the subgradient method is not a guaranteed descent method, and the fact that
the decrease in $f(x)$ at each time step is bounded below by $f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$ speaks to some of the
slowness of the method. That said, the algorithm is guaranteed to converge to within $\epsilon$ of the optimum in some semi-reasonable
amount of time, so it's not all bad. One just needs to keep in mind the caveats of the method.</p>
<h1>Let's Get Practical</h1>
<p>Now that we've finally elucidated most of the math behind subgradients, let's start applying them to a real world problem. In this
exercise, we'll devise a soft-margin SVM to classify text messages as either "spam" or "ham". We'll then train it using hinge loss
and the subgradient method and implement the entire thing in Python code, from scratch. Let's get started.</p>
<h2>An SVM-like Classifier</h2>
<p>I'm not going to go into the entire theory of SVMs as that is worthy of a series of posts all on its own. However, I will walk through
the basics of how the classifier will work, and then jump into the coding of the subgradient algorithm.</p>
<p>In short, an SVM uses a hyperplane of the form $\mathbf{w}\mathbf{x} + b$ to classify points of a training set into one of two classes.
Points that lie to the left of the hyperplane are assigned to one class, whereas points that lie to the right are assigned to the other.
Mathematically, this classification can be represented as the function
$$y = \text{sign}(\mathbf{w}\mathbf{x} + b)$$
The model is trained in such a way as to maximize the <i>margin</i> that the hyperplane decision boundary produces. Roughly speaking,
this is the spacing between the decision boundary and the closest point to either side.</p>
<p>When the data in the training set is linearly separable, we can classify it perfectly using an SVM decision boundary and the definition
of margin. However, rarely in life does our data come to us with perfect separation between the classes. As such, we give the opportunity
to the SVM to make some mistakes while still aiming to classify most of the points correctly and maximize margin. For this purpose, we introduce
<i>slack variables</i> and call the resultant model a <i>soft-margin SVM</i>. To learn more about all of these architecture components,
I encourage you to read any of the many great articles available online or in textbooks.</p>
<p>In this lesson, we're going to train a variant of a soft-margin SVM. Soft-margin SVMs are trained using the <i>hinge loss</i> which is defined
mathematically as
$$\ell(y, t) = \max (0, 1 - ty)$$
where $y = \mathbf{w}{x} + b$ is our model's prediction and $t$ is the target output value. This loss function is not differentiable at $0$, so
you know what that means? That's right, it's time for the subgradient method to shine! To use it, we need to calculate the subgradient of this loss function.</p>
<h3>The Hinge Loss Subgradient</h3>
<p>In order to train the model via the subgradient method we'll need to know what the subgradients of the hinge loss actually are. Let's calculate
that now. Since the hinge loss is piecewise differentiable, this is pretty straightforward. We have
$$\frac{\partial}{\partial w_i} (1 - t(\mathbf{w}\mathbf{x} + b)) = -tx_i$$
and
$$\frac{\partial}{\partial w_i} \mathbf{0} = \mathbf{0}$$
The first subgradient holds for $ty < 1$ and the second holds otherwise.</p>
<h3>The Code</h3>
<p>Okay, now with the math out of the way, let's get to some of the code. For this task we'll be classifying text messages as "ham" or "spam" using
the data available <a href="https://www.kaggle.com/uciml/sms-spam-collection-dataset">here</a>. Download the file "spam.csv" and extract it to a
location of your choice. I created a subfolder in my code directory called <code>data</code> and placed it there.</p>
<h4>Loading the Data</h4>
<p>Let's create a function to load and transform the data. We'll use the scikit-learn <code>CountVectorizer</code> class to create
a bag-of-words representation for each input text message in the training data set. We'll also use some of the <code>gensim</code>
preprocessing utilities to help clean up the inputs. I created the following functions which do all of the above.</p>
<pre><code class="language-python">import numpy as np
from gensim.parsing.preprocessing import preprocess_string
from sklearn.feature_extraction.text import CountVectorizer
def clean_text(l):
fields = l.strip().split(',')
return preprocess_string(fields[1])
def get_texts(lines):
return list(map(clean_text, lines))
def convert_label(l, hs_map):
fields = l.strip().split(',')
key = preprocess_string(fields[0])[0]
return hs_map[key]
def get_labels(lines, hs_map):
return list(map(lambda x: convert_label(x, hs_map), lines))
def load_data(file='data/spam.csv'):
lines = open(file, 'r', encoding='ISO-8859-1').readlines()
lines = lines[1:] # remove header line
hs_map = {'ham': 1, 'spam': -1}
y = get_labels(lines, hs_map)
texts = get_texts(lines)
texts = [' '.join(x) for x in texts]
bow = CountVectorizer()
X = bow.fit_transform(texts)
return X, np.array(y)</code></pre>
<p>We make generous use of the Python <code>map</code> functionality to map various preprocessing functionality across our strings. The actual
bag of words vectorization is a simple one-line call thanks to sklearn's <code>CountVectorizer</code>. Note that <code>CountVectorizer</code>
returns a scipy <code>sparse matrix</code> which will introduce some caveats into our training code. More on that in a bit.</p>
<p>Next, let's start implementing the loss functions. This is pretty simple with <code>numpy</code>. We have,</p>
<pre><code class="language-python">def hinge_loss(t, y):
return np.maximum(0, 1 - t * y)
def hinge_subgrad(t, y, x):
if t * y < 1:
subgrad = (-t * x).toarray()
else:
subgrad = np.zeros(x.shape)
return subgrad</code></pre>
<p>We have to call the <code>.toarray()</code> method on the first clause of <code>hinge_subgrad</code> due to the fact that <code>x</code>
will be a sparse matrix. This method just turns <code>x</code> into a regular numpy array. Note also that we use <code>np.maximum</code>
rather than <code>np.max</code>. <code>np.maximum</code> is more along the lines of the vector-based version of <code>np.max</code>.</p>
<p>Now, the hinge loss as we've implemented it only calculates the loss for a single training example $\{(x, t)\}$. We want to add in a
function which aggregates the loss across all examples in the training set. We can do that with the following function</p>
<pre><code class="language-python">def loss(w, X, y):
preds = X @ w
losses = [hinge_loss(t, y_) for t, y_ in zip(y, preds)]
return np.mean(losses)</code></pre>
<p>Finally, we'll add a couple of functions that handle making predictions for us given <code>w</code> and <code>x</code>. Note, I'm not
including a variable <code>b</code> in either of these. That's because we can collapse <code>b</code> into <code>x</code> by adding an
extra dimension to <code>x</code> and <code>w</code> with starting value 1.</p>
<pre><code class="language-python">def predictor(w, x):
return x @ w
def predict(w, X):
preds = X @ w
z = (preds > 0).astype(int)
z[z == 0] = -1
return z</code></pre>
<p>The function <code>predictor</code> takes in a single training value <code>x</code> and the weight vector <code>w</code> and returns an
unnormalized prediction <code>y = x @ w</code> which is fed to our <code>hinge_loss</code> function. On the other hand, <code>predict</code>
takes in an entire batch of training inputs <code>X</code> and returns an output vector of 1's and -1's. We'll use this to calculate the
accuracy of our model during both training and the final runtime.</p>
<p>Finally, we need a simple method that we can use to initialize our weight vector <code>w</code> at the start of training.</p>
<pre><code class="language-python">def init_w(x):
return np.random.randn(x.shape[1])</code></pre>
<h3>Coding the Subgradient Method</h3>
<p>Finally, we arrive at our pinnacle moment, that of coding up the subgradient method. The code for this method is a bit involved, but it
contains some design choices that are interesting to consider. Let's take a look at the finished product first, then walk through it step
by step.</p>
<pre><code class="language-python">def subgrad_descent(targets, inputs, w, eta=0.5, eps=.001):
curr_min = sys.maxsize
curr_iter, curr_epoch = 0, 0
while True:
curr_epoch += 1
idxs = np.arange(targets.shape[0])
np.random.shuffle(idxs)
targets = targets[idxs]
inputs = inputs[idxs]
for i, (t, x) in enumerate(zip(targets, inputs)):
curr_iter += 1
if curr_iter % 100 == 0:
preds = predict(w, inputs)
curr_acc = np.mean(preds == targets)
converged = curr_acc > .95
if converged:
return w, inputs, targets
print(f"Current epoch: {cur_epoch}")
print(f"Running iter: {curr_iter}")
print(f"Current loss: {cur_min}")
print(f"Current acc: {curr_acc}\n")
y = predictor(w, x)[0]
subgrad = hinge_subgrad(t, y, x)
w_test = np.squeeze(w - eta * subgrad)
obj_val = loss(w_test, inputs, targets)
if obj_val < cur_min:
cur_min = obj_val
w = w_test</code></pre>
<p>Now, to start, I'll say that my method is subtly different from the subgradient method I detailed in the mathematical walkthrough. That's
because in that version, all objective function values $f(w^k)$ are kept track of from start to end and the max is taken at the end of the
iteration. I take a slightly different approach. I start with an iteration $w^k$ and I only step to $w^{k+1}$ if it decreases the current
minimum loss seen across the dataset. This seems to work well, allowing me to achieve greater than 95% accuracy once the model finishes training,
but I encourage you to try out both approaches and see which works best. You'll see that I start with a current objective function value of
<code>sys.maxsize</code>. This is the max value that Python can represent, so any subsequent function value iterates are guaranteed to be less
than this value. Next, I iterate <code>while True</code> and only break from the iteration once I achieve some desired accuracy on the training
set (here I have this set to 95%). Every 100 iterations, I evaluate the current model on the training set and see if it achieves this threshold.
Note, due to time constraints, I'm being a bit careless and not creating validation sets, etc. That's because the purpose of this tutorial is to
demonstrate the viability of the subgradient method, but not to serve as an example of training procedure best practices. I encourage you to clean
things up in your own code. It's a great learning exercise.</p>
<p>Now, the method shuffles the dataset at the start of each epoch. You can see this in the lines</p>
<pre><code class="language-python">curr_epoch += 1
idxs = np.arange(targets.shape[0])
np.random.shuffle(idxs)
targets = targets[idxs]
inputs = inputs[idxs]</code></pre>
<p>This is a simple shuffling method based on randomizing the indices into the dataset. While sklearn has built in shuffling methods, I try to rely
on it as little as possible because, quite frankly, what's the fun in having some external library handle everything? The Flamethrower AI ethos is
built on learning by building everything from scratch, so that's what I'm aiming to do here.</p>
<p>Next, I iterate over each example in the training set. In effect, I'm performing <i>stochastic</i> subgradient descent.
At each step, I get the unnormalized prediction <code>y = predictor(w, x)[0]</code> using the current weight vector <code>w</code> and calculate
the corresponding <code>subgrad</code>. I then check what the updated weight vector as calculated by the subgradient method would be (given here
as <code>w_test = np.squeeze(w - eta * subgrad)</code>) and evaluate the loss that that updated vector achieves across the <i>entire</i> dataset.
If that <code>obj_val</code> is less than the <code>cur_min</code> then I set it to be the new <code>w</code>. Otherwise, I skip the update. Either
way, that completes the iteration, and the next round of updates is subsequently computed in the same manner, ad infinitum, until some desired accuracy
is reached.</p>
<h1>Conclusion</h1>
<p>That about sums up the subgradient method. I was able to reach >95% accuracy with my hand-coded method, and I'm sure it's even possible to do better
than that with some simple tweaks to the data normalization process, refining the training updates, fiddling with learning rates, etc. However, I
think what I have here provides a solid demonstration of the workability of this method. Note that training <i>will be slow.</i> This is for a couple of
reasons. First, the rate of convergence os the subgradient method is on the slower side. Second, we're implementing this in pure Python, so we're
certainly not going to be breaking any speed records here. Don't get discouraged if you compare this method against the built-in Sklearn SVM. The
sklearn code is actually just a thin wrapper over <code>libsvm</code> which is implemented entirely in heavily optimized C++, so naturally it's going to
be a lot faster. That said, this method converges relatively quickly on this dataset. I was getting good results in 5-10 minutes, certainly a lot less
time than you'd have to wait to train your favorite deep learning classifier on a really large dataset (not this one). Anyway, I hope this tutorial
introduced you to some of the fun of the subgradient method and advanced optimization techniques. If you want to get access to the repository with the
full solutions code as well as access to more advanced tutorials like this one, then I encourage you to sign up for a full Flamethrower AI membership.
It's only $20/month, and you get access to all our current and future courses for that one, low, monthly price. Cheers!</p>
Fri, 24 Apr 2020 00:00:00 +0000
http://mcneela.github.io/machine_learning/2020/04/24/Subgradient-Descent.html
http://mcneela.github.io/machine_learning/2020/04/24/Subgradient-Descent.htmlmachine_learningBuild Your Own Deep Learning Library, From Scratch<h1>Build Your Own Deep Learning Library, From Scratch</h1>
Have you ever wondered how deep learning libraries like PyTorch and Tensorflow actually work? If you're like me,
this question has probably been tugging at you for a while, and there isn't really any material online that teaches
you about these libraries' internals, outside of some obscure research papers and short of diving directly into the code.
Late last year, I had finally had enough of wondering, and I embarked on a quest to dispel the mystery of these deep learning
frameworks once and for all. I learned how they actually work, and I distilled all of the knowledge I gained into my new
course at <a href="http://flamethrower.ai">http://flamethrower.ai</a>. In this course, you'll learn about advanced deep learning
concepts like automatic differentation, hardware level optimizations, regularization techniques, Maximum a Posteriori, and so much
more. And you'll do all of this by building your very own deep learning library, COMPLETELY FROM SCRATCH! It's the first course of
its kind, and one that will prepare you exceedingly well for a career as a data scientist or machine learning engineer. It's truly
a one of a kind course and learning experience.
</br></br>
What's more, I plan to continue researching advanced deep learning tools and topics that haven't been covered in other online courses
and create courses based on their implementation. If you sign up for the site now, you'll gain access to all these new courses as I
develop them, all for one low, monthly price. It's an unbeatable offer that you won't find anywhere else.
</br></br>
As a thank you for reading my blog, I'm going to give 25% off a Flamethrower AI membership in perpetuity to the first 30 people who email
me at <a href="mailto:hello@flamethrower.ai">hello@flamethrower.ai</a> with the subject line "Flamethrower AI Blog Discount".
I truly hope you're as excited for this course as I am, and I look forward to teaching you all about the advanced deep learning concepts
you never knew about.
</br></br>
All the best,
</br>
Daniel
Sat, 11 Apr 2020 00:00:00 +0000
http://mcneela.github.io/machine_learning/2020/04/11/Build-Your-Own-Deep-Learning-Library.html
http://mcneela.github.io/machine_learning/2020/04/11/Build-Your-Own-Deep-Learning-Library.htmlmachine_learningWriting Your Own Optimizers in PyTorch<h1 id="writing-your-own-optimizers-in-pytorch">Writing Your Own Optimizers in PyTorch</h1>
<p>This article will teach you how to write your own optimizers in PyTorch - you know the kind, the ones where you can write something like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>optimizer = MySOTAOptimizer(my_model.parameters(), lr=0.001)
for epoch in epochs:
for batch in epoch:
outputs = my_model(batch)
loss = loss_fn(outputs, true_values)
loss.backward()
optimizer.step()
</code></pre></div></div>
<p>The great thing about PyTorch is that it comes packaged with a great standard library of optimizers that will cover all of your garden variety machine learning needs.
However, sometimes you’ll find that you need something just a little more specialized. Maybe you wrote your own optimization algorithm that works particularly well
for the type of problem you’re working on, or maybe you’re looking to implement an optimizer from a recently published research paper that hasn’t yet made its way
into the PyTorch standard library. No matter. Whatever your particular use case may be, PyTorch allows you to write optimizers quickly and easily, provided you know
just a little bit about its internals. Let’s dive in.</p>
<h2 id="subclassing-the-pytorch-optimizer-class">Subclassing the PyTorch Optimizer Class</h2>
<p>All optimizers in PyTorch need to inherit from <code class="language-plaintext highlighter-rouge">torch.optim.Optimizer</code>. This is a base class which handles all general optimization machinery. Within this class,
there are two primary methods that you’ll need to override: <code class="language-plaintext highlighter-rouge">__init__</code> and <code class="language-plaintext highlighter-rouge">step</code>. Let’s see how it’s done.</p>
<h3 id="the-init-method">The <strong>init</strong> Method</h3>
<p>The <code class="language-plaintext highlighter-rouge">__init__</code> method is where you’ll set all configuration settings for your
optimizers. Your <code class="language-plaintext highlighter-rouge">__init__</code> method must take a <code class="language-plaintext highlighter-rouge">params</code> argument which specifies
an iterable of parameters that will be optimized. This iterable must have a
deterministic ordering - the user of your optimizer shouldn’t pass in something
like a dictionary or a set. Usually a list of <code class="language-plaintext highlighter-rouge">torch.Tensor</code> objects is given.</p>
<p>Other typical parameters you’ll specify in the <code class="language-plaintext highlighter-rouge">__init__</code> method include
<code class="language-plaintext highlighter-rouge">lr</code>, the learning rate, <code class="language-plaintext highlighter-rouge">weight_decays</code>, <code class="language-plaintext highlighter-rouge">betas</code> for Adam-based optimizers,
etc.</p>
<p>The <code class="language-plaintext highlighter-rouge">__init__</code> method should also perform some basic checks on passed in
parameters. For example, an exception should be raised if the provided learning
rate is negative.</p>
<p>In addition to <code class="language-plaintext highlighter-rouge">params</code>, the <code class="language-plaintext highlighter-rouge">Optimizer</code> base class requires a parameter called
<code class="language-plaintext highlighter-rouge">defaults</code> on initialization. This should be a dictionary mapping parameter
names to their default values. It can be constructed from the kwarg parameters
collected in your optimizer class’ <code class="language-plaintext highlighter-rouge">__init__</code> method. This will be important in
what follows.</p>
<p>The last step in the <code class="language-plaintext highlighter-rouge">__init__</code> method is a call to the <code class="language-plaintext highlighter-rouge">Optimizer</code> base class.
This is performed by calling <code class="language-plaintext highlighter-rouge">super()</code> using the following general signature.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>super(YourOptimizerName, self).__init__(params, defaults)
</code></pre></div></div>
<h2 id="implementing-a-novel-optimizer-from-scratch">Implementing a Novel Optimizer from Scratch</h2>
<p>Let’s investigate and reinforce the above methodology using an example taken
from the HuggingFace <code class="language-plaintext highlighter-rouge">pytorch-transformers</code> NLP library. They implement a PyTorch
version of a weight decay Adam optimizer from the BERT paper. First we’ll take a
look at the class definition and <code class="language-plaintext highlighter-rouge">__init__</code> method. Here are both combined.</p>
<p><img src="/images/adamw-init.png" style="height: 75%; width: 75%" /></p>
<p>You can see that the <code class="language-plaintext highlighter-rouge">__init__</code> method accomplishes all the basic requirements
listed above. It implements basic checks on the validity of all provided <code class="language-plaintext highlighter-rouge">kwargs</code>
and raises exceptions if they are not met. It also constructs a dictionary of
defaults from these required parameters. Finally, the <code class="language-plaintext highlighter-rouge">super()</code> method is called
to initialize the <code class="language-plaintext highlighter-rouge">Optimizer</code> base class using the provided <code class="language-plaintext highlighter-rouge">params</code> and <code class="language-plaintext highlighter-rouge">defaults</code>
.</p>
<h3 id="the-step-method">The step() Method</h3>
<p>The real magic happens in the <code class="language-plaintext highlighter-rouge">step()</code> method. This is where the optimizer’s logic
is implemented and enacted on the provided parameters. Let’s take a look at how
this happens.</p>
<p>The first thing to note in <code class="language-plaintext highlighter-rouge">step(self, closure=None)</code> is the presence of the
<code class="language-plaintext highlighter-rouge">closure</code> keyword argument. If you consult the PyTorch documentation, you’ll
see that <code class="language-plaintext highlighter-rouge">closure</code> is an optional callable that allows you to reevaluate the
loss at multiple time steps. This is unnecessary for most optimizers, but is
used in a few such as Conjugate Gradient and LBFGS. According to the docs,
“the closure should clear the gradients, compute the loss, and return it”.
We’ll leave it at that, since a closure is unnecessary for the <code class="language-plaintext highlighter-rouge">AdamW</code> optimizer.</p>
<p>The next thing you’ll notice about the <code class="language-plaintext highlighter-rouge">AdamW</code> step function is that it iterates
over something called <code class="language-plaintext highlighter-rouge">param_groups</code>. The optimizer’s <code class="language-plaintext highlighter-rouge">param_groups</code> is a list
of dictionaries which gives a simple way of breaking a model’s parameters into
separate components for optimization. It allows the trainer of the model to
segment the model parameters into separate units which can then be optimized
at different times and with different settings. One use for multiple <code class="language-plaintext highlighter-rouge">param_groups</code>
would be in training separate layers of a network using, for example, different
learning rates. Another prominent use cases arises in transfer learning. When
fine-tuning a pretrained network, you may want to gradually unfreeze layers
and add them to the optimization process as finetuning progresses. For this,
<code class="language-plaintext highlighter-rouge">param_groups</code> are vital. Here’s an example given in the PyTorch documentation
in which <code class="language-plaintext highlighter-rouge">param_groups</code> are specified for SGD in order to separately tune the
different layers of a classifier.</p>
<p><img src="/images/param-groups.png" style="height: 75%; width: 75%" /></p>
<p>Now that we’ve covered some things specific to the PyTorch internals, let’s get
to the algorithm. Here’s a link to
<a href="https://arxiv.org/pdf/1711.05101.pdf">the paper</a>
which originally proposed the AdamW algorithm. And here, from the paper, is a
screenshot of the proposed update rules.</p>
<p><img src="/images/adamw-details.png" style="display: block; margin-left: auto; margin-right: auto; width: 75%" /></p>
<p>Let’s go through this line by line with the source code. First, we have the
loop</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for p in group['params']
</code></pre></div></div>
<p>Nothing mysterious here. For each of our parameter groups, we’re iterating over
the parameters within that group. Next.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
</code></pre></div></div>
<p>This is all simple stuff as well. If there is no gradient for the current
parameter, we just skip it. Next, we get the actual plain Tensor object for
the gradient by accessing <code class="language-plaintext highlighter-rouge">p.grad.data</code>. Finally, if the tensor is sparse, we
raise an error because we are not going to consider implementing this for sparse
objects.</p>
<p>Next, we access the current optimizer state with</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>state = self.state[p]
</code></pre></div></div>
<p>In PyTorch optimizers, the <code class="language-plaintext highlighter-rouge">state</code> is simply a dictionary associated with the
optimizer that holds the current configuration of all parameters.</p>
<p>If this is the first time we’ve accessed the state of a given parameter, then we
set the following defaults</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p.data)
</code></pre></div></div>
<p>We obviously start with step 0, along with zeroed out exponential average and
exponential squared average parameters, both the shape of the gradient tensor.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
beta1, beta2 = group['betas']
state['step'] += 1
</code></pre></div></div>
<p>Next, we gather the parameters from the state dict that will be used in the
computation of the update. We also increment the current step.</p>
<p>Now, we begin the actual updates. Here’s the code.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Decay the first and second moment running average coefficient
# In-place operations to update the averages at the same time
exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_(1.0 - beta2, grad, grad)
denom = exp_avg_sq.sqrt().add_(group['eps'])
step_size = group['lr']
if group['correct_bias']: # No bias correction for Bert
bias_correction1 = 1.0 - beta1 ** state['step']
bias_correction2 = 1.0 - beta2 ** state['step']
step_size = step_size * math.sqrt(bias_correction2) / bias_correction1
p.data.addcdiv_(-step_size, exp_avg, denom)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want to decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
# Add weight decay at the end (fixed version)
if group['weight_decay'] > 0.0:
p.data.add_(-group['lr'] * group['weight_decay'], p.data)
</code></pre></div></div>
<p>The above code corresponds to equations 6-12 in the algorithm implementation from
the paper. Following along with the math should be easy enough. What I’d like to
take a closer look at is the built in Tensor methods that allow us to do the
in-place computations.</p>
<p>A nice, relatively hidden feature of PyTorch which you might not be aware of is
that you can access any of the standard PyTorch functions, e.g. <code class="language-plaintext highlighter-rouge">torch.add()</code>,
<code class="language-plaintext highlighter-rouge">torch.mul()</code>, etc. as in-place operations on the Tensors directly by appending
an <code class="language-plaintext highlighter-rouge">_</code> to the method name. Thus, taking a closer look at the first update, we
find we can quickly compute it as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
</code></pre></div></div>
<p>rather than</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch.mul(beta1, torch.add(1.0 - beta1, grad))
</code></pre></div></div>
<p>Of course, there are a few special operations used here with which you may not
be familiar, for example, <code class="language-plaintext highlighter-rouge">Tensor.addcmul_</code> and <code class="language-plaintext highlighter-rouge">Tensor.addcdiv_</code>. This takes the
input and adds it to either the product or dividend, respectively, of the two
latter inputs. If you need a more in-depth rundwon of the various operations
available to be performed on <code class="language-plaintext highlighter-rouge">Tensor</code> objects, I highly recommend checking out
<a href="https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/">this post</a>.</p>
<p>You’ll also see that the learning rate is accessed in the last line in the
computation of the final result. This loss is then returned.</p>
<p>And…that’s it! Constructing your own optimizers is as simple as that. Of course,
you need to devise your own optimization algorithm first, which can be a little
bit trickier ;). I’ll leave that one to you.</p>
<p>Special thanks to the authors of Hugging Face for implementing the <code class="language-plaintext highlighter-rouge">AdamW</code>
optimizer in PyTorch.</p>
Tue, 03 Sep 2019 00:00:00 +0000
http://mcneela.github.io/machine_learning/2019/09/03/Writing-Your-Own-Optimizers-In-Pytorch.html
http://mcneela.github.io/machine_learning/2019/09/03/Writing-Your-Own-Optimizers-In-Pytorch.htmlmachine_learningMachine Learning for the Movies<h1 align="center">Machine Learning for the Movies</h1>
(Note: This article originally appeared at <a href="https://www.clarifai.com/blog/machine-learning-for-the-movies">https://www.clarifai.com/blog/machine-learning-for-the-movies</a>. If you are looking for a great computer
vision solution for your machine learning product, I highly recommend you
check out Clarifai!)
</br></br>
For many, machine learning is useful only insomuch that the insights it generates drive business or cut costs. Within the film industry, top studios have traditionally banked huge budgets on new scripts predicated on little but studio executives’ past experience, intuition, and hopeful conjecture. However, 20th Century Fox recently demonstrated that a paradigm shift within the entertainment industry may be underway. Their team of data scientists and researchers devised a machine learning model called Merlin Video that leverages film trailer data to predict which movies a filmgoer would be most likely to see given their viewing history and other demographic information. The team chose movie trailers as their object of study because they act as the most impactful determinant in a customer’s decision as to whether or not they will go to see a movie in theatres.
</br></br>
<h2>Understanding the Model Architecture</h2>
At the heart of Merlin are convolutional neural networks (CNNs), a type of model which has classically been used to achieve state of the art results on image recognition tasks. Merlin employs these neural networks by applying them to the individual frames of a movie trailer; however, the architecture includes clever processing steps which allow the model to capture certain aspects of the trailer’s timing. The model also relies heavily on a technique called collaborative filtering that’s commonly used when devising recommender systems. The crux of the idea is that a recommendation model should incorporate a wide diversity of data sources. In addition, it relies on the belief that if user A has similar tastes to user B on known data, then that shared similarity in preferences is likely to extend to unknown data.
<div align="middle">
<img src="/images/movie-model-breakdown.png" style="height: 50%; width: 50%"></img>
</div>
The output of the model relies primarily on what are called the movie and user vectors. The idea is that if accurate representations of each can be computed, then a proxy for the affinity a given user has for a given movie can be determined by computing the distance between their respective vectors. This distance is combined with user frequency and recency data and fed into a simple logistic regression classifier which provides the final output prediction giving the probability that user i will watch movie j.
</br></br>
So how are the movie and user vectors created? The user vector is actually pretty simple. It’s just the averaged sum of the vectors corresponding to the movies that that particular user attended. As such, the real magic of the model relies in the creation of the movie vector. The movie vector is, in fact, created by the CNN previously alluded to. The global structure of the network is that it defines a number of features designed to capture specific actions relevant to a movie’s content. For example, one feature might seek to determine whether a trailer involves long, scenic shots of nature. This could indicate that the trailer is for a documentary. Another feature might try to detect a fast-paced fist fight indicative of an action movie. A key aspect of the model is that it goes beyond conventional CNNs by capturing the pacing and temporality of film sequences. That means it can tell the difference between quickly flickering frames which might indicate a flashback or a high speed chase and long, drawn out shots of dialogue or other slow-moving moments. Here’s the full diagram of the model which computes the movie vector.
<div align="middle">
<img src="/images/movie-layer-details.png" style="height: 50%; width: 50%"></img>
</div>
</br></br>
<h2>Training the Model</h2>
The team at Fox trained the model on YouTube8M, a rich dataset provided by Google and consisting of 6.1 million YouTube videos annotated with any of 3800+ entities. The dataset provides 350,000 hours of video described by 2.6 billion precomputed audio and video features. This provides massive explanatory power for the Merlin model to take advantage of.
<div align="middle">
<img src="/images/model-architecture-flow.png" style="height: 50%; width: 50%"></img>
</div>
</br></br>
If you take a look at the above diagram which lays out the Merlin architecture’s data flow, you’ll see that they also feed the model film metadata, textual synopses, and data about the customer acquired at the ticket box.
</br></br>
<h2>Evaluating the Model</h2>
To assess the accuracy of their model, the team at Fox evaluated its predictions on the trailer of the recently released action flick, Logan. For those unfamiliar with the film, it’s an X-Men spinoff which focuses on the trials and travails of Wolverine as he wages war against the bad guys and saves the girl. Pretty typical Hollywood stuff. Astonishingly, the Merlin model captures the majority of the key ideas presented in the Logan trailer and uses these to accurately predict similar movies for filmgoers to see. Here’s the data that the Fox team got from the model and its comparison with actual customer behavior.
<div align="middle">
<img src="/images/movie-model-output-results.png" style="height: 50%; width: 50%"></img>
</div>
On the left, you can see the Top 20 movies that a user who saw Logan was mostly likely to watch. On the right, you can see the Top 20 predictions made by the model. Astonishingly, the model got all of the Top 5 actual movies within its Top 20 predictions. As a result, it’s reasonable to believe that the model was able to distill the key characteristics of Logan in order to infer its own predictions. That’s the power of machine learning.
Mon, 26 Aug 2019 00:00:00 +0000
http://mcneela.github.io/machine_learning/2019/08/26/Machine-Learning-for-the-Movies.html
http://mcneela.github.io/machine_learning/2019/08/26/Machine-Learning-for-the-Movies.htmlmachine_learningThe Basics of Homology Theory<h1>The Basics of Homology Theory</h1>
Algebraic topology is a field that seeks to establish
correspondences between algebraic structures and topological
characteristics. It then uses results from algebra to infer
and uncover results about topology. It's a pretty powerful method.
</br></br>
Roughly speaking, AT provides two different frameworks for
characterizing topological spaces. These are <i>homotopy</i>
and <i>homology</i>. In this post, we'll start to take a look
at homology, which differs from homotopy in that it's less
powerful in some senses, but significantly easier to work with
and compute which endows it with a different sort of power.
</br></br>
<h2>Some Definitions to Start</h2>
<b><u>Definition</u></b> Say we're working in $\mathbb{R}^n$. The <i>$p$-simplex</i>, defined for $p \leq n$, is
$$\Delta_p := \left\{x = \sum_{i=0}^p \lambda_i e_i \mid \sum_{i=0}^p \lambda_i\ = 1, \lambda_i \geq 0\right\}$$
$\Delta_p$ is a generalization of the triangle to $p$ dimensions. Now, there are actually two types of homology
which have developed in the annals of mathematics.
The first of these is <i>singular homology</i>, which concerns
itself with the study of topological spaces via the mapping of
simplices into these spaces. The other type is called
<i>Cech homology</i> which handles the study of topology via
the approximation of topological spaces with spaces of a certain
class, namely those which admit a triangulation. Of these two
branches of homology, singular is by far the more prevalent in
the literature and is the one we'll delve into here.
</br></br>
Given a triangular polytope like we've defined in $\Delta_p$,
one operation we might like to consider is using that region
as a way to sort of define regions of interest around not just
basis vectors, but any arbitrary collection of vectors in the space. Given such a set $\{v_0, \ldots, v_p\}$ we can denote
by $[v_0, \ldots, v_p]$ the mapping of $\Delta_p \to \mathbb{R}^n$ defined by
$$\sum_{i=0}^p \lambda_i e_i \to \sum_{i=0}^p \lambda_i v_i$$
What this gives us from an intuitive perspective is the simplex
expanded or shrunken to cover the span of the $v_i$. One nice
property of this map is that its image is convex. We call the
resulting simplex the <i>affine p-simplex</i>, and we sometimes
refer to the $\lambda_i$ in this context as <i>barycentric coordinates</i>.
</br></br>
In addition to mapping between simplices and sets of vectors,
we'd like to define a way to map between a $p$-simplex and a
($p + 1$)-simplex and vice versa. The mapping from
$p$ to $p + 1$ is called the $i$th face map and is notated
as
$$F_i^{p+1} : \Delta_p \to \Delta_{p + 1}$$
It is formed by deleting the $i$th vertex in dimension
$p + 1$. To notate this, we can write $[e_0, \ldots, \hat{e_i}
, \ldots, e_{p+1}]$ where the hat indicates that $e_i$ is
omitted. This face map is so named because it embeds the
$p$-simplex in the $p+1$-simplex as the face opposite $e_i$, the
vertex that's being deleted.
</br></br>
<b><u>Definition</u></b> For a topological space $X$, a <i>
singular $p$-simplex</i> of $X$ is simply a continuous function
$\sigma_p : \Delta_p \to X$.
</br></br>
<div align="middle">
<img src="/images/singular2simplex.png"></img>
<p>The singular 2-simplex, mapping from the standard 2-simplex to $X$.</p>
</div>
Basically, our goal in defining the singular $p$-simplices is
to provide a sort of "basis" for the triangulation of an
arbitrary topological space. In this way, the singular $p$-simplex gives us a way to cover some patch of $X$ with
a triangle-like region. Accordingly, we can define a group
which in some sense acts like a vector space of these triangular
basis regions. By allowing the linear combination of maps on
these elements, we can give entire triangulations of the space
$X$.
</br></br>
<b><u>Definition</u></b> The <i>singular p-chain group
$\Delta_p(X)$</i> is the free abelian group that's generated by the singular $p$-simplices.
</br></br>
In more concrete terms, the elements of $\Delta_p(X)$ are called
$p$-chains and are simply linear combinations
$$c = \sum_\sigma n_\sigma \sigma$$ of $p$-simplices with
coefficients $n_\sigma$ coming from some ring (usually the
integers).
</br></br>
We can recover how the singular $p$-simplex maps the faces of
$\Delta_p$ to $X$ by simply composing the map $\sigma$ with the
face map $F_i^p$. This is called the <i>ith face of $\sigma$</i>
and is written
$$\sigma^{(i)} = \sigma \circ F_i^p$$
For a given singular $p$-simplex $\sigma$, we can defined a $(p-1)$-chain that gives the <i>boundary</i> of $\sigma$ as
$$\partial_p \sigma = \sum_{i=0}^p (-1)^i \sigma_p^{(i)}$$
The boundary operator extends to chains in the natural way, by distributing over addition, $\partial_p c = \partial_p (\sum_\sigma n_\sigma \sigma) = \sum_\sigma n_\sigma \partial_p
\sigma$. This law, in fact, makes $\partial_p$ into a
homomorphism of groups
$$\partial_p : \Delta_p(X) \to \Delta_{p-1}(X)$$
Now, we introduce a key fact about the boundary operator that
will allow us to define homology groups. You should be well
aware of the saying "the enemy of my enemy is my friend".
Well, in algebraic topology boundaries are our enemies and
we have a similar statement, "the boundary of my boundary is
a loser (zero)". In other words, for any $\sigma$,
$$\partial_p(\partial_{p + 1} \sigma) = 0$$
I'll skip the proof for this because it's just a really nasty
rearrangement of a complicated sum, but if you're truly
interested you can find it in pretty much any algebraic
topology textbook.
</br></br>
We call $im(\partial_{p+1}) = B_{p}(X)$ the
<i>boundary group</i> and $ker(\partial_p) = Z_p(X)$ the
<i>cycle group</i>.
Note that the above identity implies that $im(\partial_{p + 1})$ is a subgroup of $ker(\partial_p)$ which is equivalent
to saying $B_p(C) \leq Z_p(C)$.
</br></br>
<h2>The Homology Group</h2>
We define the $p$th <i>homology group</i> as the quotient
$$H_p(X) := Z_p(X) / B_p(X)$$
In other words, this is the group of cycles modded out by
the group of boundaries. To give a rough intuition, the
rank of $H_p(X)$ tells us the number of $p$-dimensional
"holes" contained in $X$. You can think about it as follows.
If each cycle in $X$ is equivalent to some boundary, then
those boundaries have no "interior" excisions. However, if a
hole does exist, then there will be some discrepancy between
boundaries and cycles, and the number of such discrepancies
(holes) will be given by the rank of $H_p(X)$.
Thu, 13 Jun 2019 00:00:00 +0000
http://mcneela.github.io/mathematics/2019/06/13/The-Basics-Of-Homology-Theory.html
http://mcneela.github.io/mathematics/2019/06/13/The-Basics-Of-Homology-Theory.htmlmathematicsThe Problem with Policy Gradient<h1 align="middle">The Problem(s) with Policy Gradient</h1>
If you've read my <a href="http://mcneela.github.io/math/2018/04/18/A-Tutorial-on-the-REINFORCE-Algorithm.html">article</a>
about the REINFORCE algorithm, you should be familiar with the update that's typically used in policy gradient methods.
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}
\left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$
It's an extremely elegant and theoretically satisfying model that suffers from only one problem - it doesn't work well in practice.
Shocking, I know! Jokes abound about the flimsiness that occurs when policy gradient methods are applied to practical problems.
One such joke goes like this: if you'd like to reproduce the results of any sort of RL policy gradient method as reported in academic
papers, make sure you contact the authors and get the settings they used for their random seed. Indeed, sometimes policy gradient can
feel like nothing more than random search dressed up in mathematical formalism. The reasons for this are at least threefold
(I won't rule out the possibility that there are more problems with this method of which I'm not yet aware), namely that
</br></br>
<ol>
<li>Policy gradient is <b>high variance</b>.</li>
<li>Convergence in policy gradient algorithms is <b>sloooow</b>.</li>
<li>Policy gradient is terribly <b>sample inefficient</b>.</li>
</ol>
I'll walk through each of these in reverse because flouting the natural order of things is fun. :)
</br></br>
<h3>Sample Inefficiency</h3>
In order to get anything useful out of policy gradient, it's necessary to sample from your policy and observe the resultant reward
literally <i>millions of times</i>. Because we're sampling directly from the policy we're optimizing, we say that policy gradient
is an <i>on-policy</i> algorithm.
If you take a look at the formula for the gradient update, we're calculating an expectation and
we're doing that in the Monte Carlo way, by averaging over a number of trial runs. Within that, we have to sum over all the steps in
a single trajectory which itself could be frustratingly expensive to run depending on the nature of the environment you're working
with. So we're iterating sums over sums, and the result is that we incur hugely expensive computational costs in order to acquire
anything useful. This works fine in the realms where policy gradient has been successfully applied. If all you're interested
in is training your computer to play Atari games, then policy gradient might not be a terrible choice. However, imagine using this
process in anything remotely resembling a real-world task, like training a robotic arm to perform open-heart surgery, perhaps?
Hello, medical malpractice lawsuits. However, sample inefficiency is not a problem that's unique to policy gradient methods by any
means. It's an issue that plagues many different RL algorithms, and addressing this is key to generating a model that's useful
in the real world. If you're interested in sample efficient RL algorithms, check out
<a href="https://www.microsoft.com/en-us/research/blog/provably-efficient-reinforcement-learning-with-rich-observations/?ocid=msr_blog_provably_icml_hero">the work</a> that's being
done at Microsoft Research.
</br></br>
<h3>Slow Convergence</h3>
This issue pretty much goes hand in hand with the sample inefficiency discussed above and the problem of high variance to be
discussed below. Having to sample entire trajectories on-policy before each gradient update is slow to begin with, and the
high variance in the updates makes the search optimization highly inefficient which means more sampling which means more updates,
ad infinitum. We'll discuss some remedies for this in the next section.
</br></br>
<h3>High Variance</h3>
The updates made by the policy gradient are very high variance. To get a sense for why this is, first considering that in RL we're
dealing with highly general problems such as teaching a car to navigate through an unpredictable environment or programming an agent
to perform well across a diverse set of video games. Therefore, when we're sampling multiple trajectories from our untrained policy
we're bound to observe highly variable behaviors. Without any a priori model of the system we're seeking to optimize, we begin with
a policy whose distribution of actions over a given state is effectively uniform. Of course, as we train the model we hope to shape
the probability density so that it's unimodal on a single action, or possibly multimodal over a few successful actions that can be
taken in that state. However, acquiring this knowledge requires our model to observe the outcomes of many different actions taken
in many different states. This is made exponentially worse in continuous action or state spaces as visiting even close to every
state-action pair is computationally intractable. Due to the fact that we're using Monte Carlo estimates in policy gradient, we
trade off between computational feasibility and gradient accuracy. It's a fine line to walk, which is why variance reduction techniques
can potentially yield huge payoffs.
</br></br>
Another way to think about the variance introduced into the policy gradient update is as follows: at each time step in your trajectory
you're observing some stochastic event. Each such event has some noise, and the accumulation of even a small amount of noise across
a number of time steps results in a high variance outcome. Yet, understanding this allows us to suggest some ways to alter policy
gradient so that the variance might ultimately be reduced.
</br></br>
<h1 align="middle">Improvements to Policy Gradient</h1>
<h3>Reward to Go</h3>
The first "tweak" we can use is incredibly simple. Let's take a look again at that policy gradient update.
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}
\left[ \left(\sum_{t} \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_t r(s_t, a_t)\right)\right]$$
If we break it down into the Monte Carlo estimate, we get
$$\nabla_\theta J(\theta) =
\frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_{t=1}^T r(s_t, a_t)\right)\right]$$
If we distribute $\sum_{t=1}^T r(s_t, a_t)$ into the left innermost sum involving $\nabla \log \pi_{\theta}$, we see that we're
taking the gradient of $\log \pi_\theta$ at a given time step $t$ and weighting it by the sum of rewards at all timesteps. However,
it would make a lot more sense to simply reweight this gradient by the rewards it affects. In other words, the action taken at time
$t$ can only influence the rewards accrued at time $t$ and beyond. To that end, we replace $\sum_{t=1}^T r(s_t, a_t)$ in the gradient
update with the partial sum $\sum_{t'=t}^T r(s_{t'}, a_{t'})$ and call this quantity $\hat{Q}_{t}$ or the "reward to go". This quantity
is closely related to the $Q$ function, hence the similarity in notation. For clarity, the entire policy gradient update now becomes
$$\frac{1}{N} \sum_{i=1}^N \left[ \left(\sum_{t=1}^T \nabla_\theta \log{\pi_\theta}(a_t \mid s_t)\right) \left(\sum_{t=t'}^T r(s_{t'}, a_{t'})\right)\right]$$
<h3>Baselines</h3>
The next technique for reducing variance is not quite as obvious but still yields great results. If you think about how policy gradient
works, you'll notice that how we take our optimization step depends heavily on the reward function we choose. Given a trajectory $\tau$,
if we have a negative return $r(\tau) = \sum_{t} r(s_t, a_t)$ then we'll actually take a step in the direction opposite the gradient,
which should have the effect of lessening the probability density on the trajectory. For those trajectories that have positive return,
their probability density will increase. However, if we do something as simple as setting $r(\tau) = r(\tau) + b$ where $b$ is a
sufficiently large constant such that the return for $r(\tau)$ is now positive, then we will actually increase the probability weight on
$\tau$ even though $\tau$ still fares worse than other trajectories with previously positive return. Given how sensitive the model is
to the shifting and scaling of the chosen reward function, it's natural to ask whether we can find an optimal $b$ such that
(note: we're using trajectories here so some of the sums from the original PG formulation are condensed)
$$\frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) [r(\tau_i) - b]$$
has minimum variance. We call such a $b$ a <i>baseline</i>.
We also want to ensure that subtracting $b$ in this way doesn't bias our estimate of the gradient. Let's do that
first. Recall the identity we used in the original policy gradient derivation
$$\pi_\theta(\tau) \nabla \log \pi_\theta(\tau) = \nabla \pi_\theta(\tau)$$
To show that our estimator remains unbiased, we need to
show that
$$\mathbb{E}\left[\nabla \log \pi_\theta(\tau_i)[r(\tau_i) - b]\right] = \mathbb{E} [\nabla \log \pi_\theta(\tau_i)]$$
We can equivalently show that $\mathbb{E} [\nabla \log \pi_\theta(\tau_i) b]$ is equal to zero. We have
\begin{align*}
\mathbb{E} [\nabla \log \pi_\theta(\tau_i) b]
&= \int \pi_\theta(\tau_i) \nabla \log \pi_\theta(\tau_i) b \ d\tau_i \\
&= \int \nabla \pi_\theta(\tau_i) b \ d\tau_i \\
&= \nabla b \int \pi_\theta(\tau_i) \ d\tau_i \\
&= \nabla b 1 \\
&= 0
\end{align*}
where we use the fact that $\int \pi_\theta(\tau_i) \ d\tau_i$
is 1 because $\pi_\theta$ is a probability distribution.
Therefore, our baseline enhanced version of the policy gradient
remains unbiased.
</br></br>
The question then becomes, how do we choose an optimal setting
of $b$. One natural candidate is the average reward
$b = \frac{1}{N} \sum_{i=1}^N r(\tau_i)$ over all trajectories
in the simulation. In this case, our returns are "centered",
and returns that are better than average end up being
positively weighted whereas those that are worse are negatively
weighted. This actually works quite well, but it is not, in fact, optimal. To calculate the optimal setting, let's look at
the policy gradient's variance. In general, we have
\begin{align*}
Var[x] &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \\
\nabla J(\theta) &= \mathbb{E}_{\tau \sim \pi_\theta(\tau)}
\left[ \nabla \log \pi_\theta(\tau) (r(\tau) - b)\right] \\
Var[\nabla J(\theta)] &= \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[(\nabla \log \pi_\theta(\tau) (r(\tau) - b))^2\right] - \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[ \nabla_\theta \log \pi_\theta(\tau) (r(\tau) - b)\right]^2
\end{align*}
The rightmost term in this expression is just the square of the
policy gradient, which for the purposes of optimizing $b$ we
can ignore since baselines are biased in expectation. Therefore, we turn our attention to the left term.
To simplify notation, we can write
$$g(\tau) = \nabla \log \pi_\theta(\tau)$$
Then we take the derivative to get
\begin{align*}
\frac{dVar}{db} &= \frac{d}{db} \mathbb{E}\left[
g(\tau)^2(r(\tau) - b)^2\right] \\
&= \frac{d}{db}(\mathbb{E}[g(\tau)^2r(\tau)^2] - 2
\mathbb{E}[g(\tau)^2r(\tau)b] + b^2\mathbb{E}[g(\tau)^2]) \\
&= 0 -2\mathbb{E}[g(\tau)^2r(\tau)] + 2b\mathbb{E}[g(\tau)^2]
\end{align*}
Solving for $b$ in the final equation gives
$$b = \frac{\mathbb{E}[g(\tau)^2r(\tau)]}{\mathbb{E}[g(\tau)^2]}
$$
In other words, the optimal setting for $b$ is to take the
expected reward but reweight it by expected gradient magnitudes.
</br></br>
<h2>Conclusion</h2>
Hopefully this provided you with a good overview as to how
you can improve implementations of policy gradient to speed
up convergence and reduce variance. In a future article, I'll
discuss how to derive an off-policy version of policy gradient
which improves sample efficiency and speeds up convergence.
Mon, 03 Jun 2019 00:00:00 +0000
http://mcneela.github.io/machine_learning/2019/06/03/The-Problem-With-Policy-Gradient.html
http://mcneela.github.io/machine_learning/2019/06/03/The-Problem-With-Policy-Gradient.htmlmachine_learning