*This post was originally featured on my other blog at Flamethrower AI*

# Introduction

You've probably heard of the gradient descent algorithm. It's probably the most widely used algorithm in machine learning, used to train everything from neural networks to logistic regression. What you probably didn't know though, is that it doesn't always work. That's right, gradient descent has some preconditions that your loss function needs to satisfy in order for the algorithm to run. One of these is that the loss function must be differentiable. That means that when you need to optimize a loss function that's not differentiable, such as the L1 loss or hinge loss, you're flat out of luck.

*Or are you?* Thankfully, you're actually not. There's a little-known method that's been around for a long
time within the field of convex optimization that uses the notion of *subgradients* to perform optimization.
We're going to tackle the theory behind it in this post today, then build our own implementation of it in Python
in order to create our own soft-margin SVM classifier with hinge loss to classify text messages as spam or ham.
Let's get started.

# The Theory

## The Definition of a Subgradient

A subgradient is a generalization of the gradient which is useful for non-differentiable, but convex, functions.
The basic idea is this: When a function is convex, you can draw a line (or in higher dimensions, a hyperplane)
through each point $f(x)$ and that line will
underapproximate the function everywhere. This is baked into the definition of convexity. In fact, for many points
$x$ there is more than one such line. A subgradient is simply any one of these lines, and it is defined mathematically
as
$$g \in \mathbb{R}^n \text{ such that } f(z) \geq g^\top (z - x) \text{ for all } z \in \text{dom}(f)$$
The definition can be a little bit confusing, so let me break it down piece by piece. The vector $g$ is the subgradient
and it's also what's called a *normal vector*.
It lies perpendicular to the hyperplane which underapproximates $f(z)$. The normal vector is all that's needed to define
the angle of the hyperplane, and by varying $z$, we can land at any point on this hyperplane.

Another way of characterizing the subgradient $g$ is via what's called the *epigraph* of $f$. The epigraph of $f$
is just all the area above and including the graph of $f$. In other words, it's the set you'd get if you colored in above
the graph of $f$. We say that $g$ is a subgradient if the hyperplane it defines *supports* the epigraph of $f$.

When we say "support" here, it has a particular meaning and is not to be confused with the support of a function in an analysis sense. More concretely, a hyperplane is said to support a set $S$ if it intersects $S$ in at least one point (for our purpose, this point is $f(x)$) and the remainder of the entire set $S$ lies on one side of the hyperplane. Because the epigraph of a convex function lies above its subgradient at every point, this is a natural way to conceptualize things.

## Subdifferentials

Now that we've introduced subgradients, lets move onto a broader concept, that of a *subdifferential*. We say that
a function $f$ is *subdifferentiable* if for each point $x \in \text{dom}(f)$ there exists a subgradient $g_x$ at
that point. Now, its important to note that the subgradient $g_x$ for a point $x$ is not unique. In fact, for any point
$x$, there can exist either 0, 1, or infinitely many such subgradients. We call the set of all such subgradients at a point
$x$ the *subdifferential* of $f$ at that point, and we denote it as $\partial f(x)$. In other words,
$$\partial f(x) = \{g_x \mid f(z) \geq g_x^\top (z - x)\quad \forall z \in \text{dom}(f)\}$$
So why talk about subdifferentials? Well, they have some nice properties and integrate nicely with the regular theory of
differentiation. Allow me to explain.

First off, the subdifferential $\partial f(x)$ associated with a point $x$ is **always** a closed, convex set. This
follows from the fact that it can be viewed as the intersection of a set of infinite half-spaces.
$$\partial f(x) = \bigcap_{z \in \text{dom}(f)} \{g_x \mid f(z) \geq g_x^\top (z-x)\quad \forall z \in \text{dom}(f)\}$$
This means the problem of finding the optimal subgradient $g_x$ at a given point $x$ is a solvable one, because doing
optimization over convex sets is a solved problem.

The more important point, though, is that the subdifferential satisfies many of the same laws of calculus as the standard differential. Concretely, much like regular differentials, we can

- Scale differentials by constants - $\partial f(\alpha x) = \alpha \partial f(x)$
- Distribute over sums, integrals, and expectation - $\partial \int f(x)\ dx = \int \partial f(x)\ dx$

Another important point is the direct connection of gradients and subdifferentials. If a function $f$ is differentiable at a point $x$, then there exists only one subgradient $g_x$ to $f$ at $x$ and for the subdifferential we have $$\partial f(x) = \{g_x = \nabla f(x)\}$$ In other words, there is a direct correspondence between and uniqueness of gradients and subgradients in the case where $f$ is differentiable. Thus in some sense, the subgradient is a generalization of the gradient that applies in more general situations, (e.g. nondifferentiability).

## Enough with the Definitions Already

Okay, let's move on from tedious definitions and see how we can start using subgradients to solve a problem that's typically intractable with standard, gradient-based methods. For this, let's take a look at the $\ell_1$ norm.

The $\ell_1$ norm is defined as $$\|x\|_1 = \sum_{i=1}^n |x_i|$$ It is a nondifferentiable and convex function of $x$. This means that while we can't tackle its optimization with gradient descent, we can approach it using the method of subgradients. Let's see how that works.

Before we get started, we need to recite a quick fact about subgradients. Namely, if we have a sequence of functions $f_1, \ldots, f_n$ where each of the $f_i$ are convex and subdifferentiable. Then their pointwise maximum $f(x)$, defined as $$f(x) = \max_{i} f_i(x)$$ is also subdifferentiable and has $$\partial f(x) = {\bf Co} \left(\bigcup \{\partial f_i(x) \mid f_i(x) = f(x)\} \right)$$ In other words, to find the subdifferential of the max, $f(x)$, we take the convex hull of the union of subdifferentials for each function $f_i(x)$ which attains the pointwise maximum of $\{f_1, \ldots, f_n\}$ at each point $x$. It's a little bit of a convoluted and long-winded statement, but hopefully it will make more sense when we apply it to the problem of calculating the subgradient of the $\ell_1$ norm.

### Applying the Rule

In order to apply this rule of maximums, we have to find a way to rewrite the $\ell_1$ norm as a pointwise maximum of convex, subdifferentiable functions. This is actually pretty easy to do. To see why, consider the vector $$s = [s_1, \ldots, s_n] \quad s_i \in \{-1, 1\}$$ where each of the $s_i$ is either equal to 1 or -1. The idea is you could set $s_i = -1$ when $x_i < 0$ and $s_i = 1$ when $x_i > 0$. When $x_i = 0$, either $s_i = 1$ or $s_i = -1$ can be used. Thus, we have $$\ell_1(x) = \max \{s^\top x | s_i \in \{-1, 1\} \}$$ Since we've rewritten the $\ell_1$ norm as a max of functions, we can now apply the rule to find its subgradient. Since each of the $f_i$ just have the form $s^T x$, they are each differentiable with unique (sub)gradient $s$. Therefore the subgradient of $f = \max_i f_i$ is just the convex hull of this set of subgradients, which can be expressed as $$\{g \mid \|g\|_\infty \leq 1, g^Tx = \|x\|_1\}$$

## The Subgradient Method

Now, we arrive at our culminating theoretical moment, that of applying subgradients to problems of convex optimization.
We do so using what's called the *subgradient method* which looks almost identical to gradient descent. The algorithm
is an iteration which asserts that we make steps according to
$$x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)}$$
where $\alpha_k$ is our learning rate. There are a few key differences when compared with gradient descent though. The first
of these is that our $\alpha_k$ must be fixed in advance and not determined dynamically on the fly. The next is that the subgradient
method is not a true descent method. This is because each subsequent step $x^{(k+1)}$ is not guaranteed to decrease our objective
function value. In general, we keep track of our objective values and pick their max at the end of iteration, i.e. we set
$$f^{\star} = \max_k \{f^{(k)} = f(x^{(k)})\}$$

To see why the objective function value is not guaranteed to decrease with each iterate, we need only look to the definition of a subgradient. Namely, we have $$f(x^{(k+1)}) = f(x^{(k)} - \alpha_k g^{(k)}) \geq f(x^{(k)}) + {g^{(k)}}^\top(x^{(k)} - \alpha_k g^{(k)} - x^{(k)}) = f(x^{(k)}) - g^{(k)}\alpha_k g^{(k)} = f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$$ Since $f(x^{(k+1)})$ can be any value greater than $f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$ including $f(x^{(k)}) + C$ for some positive constant $C$, it's clear that the subgradient method is not a guaranteed descent method, and the fact that the decrease in $f(x)$ at each time step is bounded below by $f(x^{(k)}) - \alpha_k \|g^{(k)}\|_2^2$ speaks to some of the slowness of the method. That said, the algorithm is guaranteed to converge to within $\epsilon$ of the optimum in some semi-reasonable amount of time, so it's not all bad. One just needs to keep in mind the caveats of the method.

# Let's Get Practical

Now that we've finally elucidated most of the math behind subgradients, let's start applying them to a real world problem. In this exercise, we'll devise a soft-margin SVM to classify text messages as either "spam" or "ham". We'll then train it using hinge loss and the subgradient method and implement the entire thing in Python code, from scratch. Let's get started.

## An SVM-like Classifier

I'm not going to go into the entire theory of SVMs as that is worthy of a series of posts all on its own. However, I will walk through the basics of how the classifier will work, and then jump into the coding of the subgradient algorithm.

In short, an SVM uses a hyperplane of the form $\mathbf{w}\mathbf{x} + b$ to classify points of a training set into one of two classes.
Points that lie to the left of the hyperplane are assigned to one class, whereas points that lie to the right are assigned to the other.
Mathematically, this classification can be represented as the function
$$y = \text{sign}(\mathbf{w}\mathbf{x} + b)$$
The model is trained in such a way as to maximize the *margin* that the hyperplane decision boundary produces. Roughly speaking,
this is the spacing between the decision boundary and the closest point to either side.

When the data in the training set is linearly separable, we can classify it perfectly using an SVM decision boundary and the definition
of margin. However, rarely in life does our data come to us with perfect separation between the classes. As such, we give the opportunity
to the SVM to make some mistakes while still aiming to classify most of the points correctly and maximize margin. For this purpose, we introduce
*slack variables* and call the resultant model a *soft-margin SVM*. To learn more about all of these architecture components,
I encourage you to read any of the many great articles available online or in textbooks.

In this lesson, we're going to train a variant of a soft-margin SVM. Soft-margin SVMs are trained using the *hinge loss* which is defined
mathematically as
$$\ell(y, t) = \max (0, 1 - ty)$$
where $y = \mathbf{w}{x} + b$ is our model's prediction and $t$ is the target output value. This loss function is not differentiable at $0$, so
you know what that means? That's right, it's time for the subgradient method to shine! To use it, we need to calculate the subgradient of this loss function.

### The Hinge Loss Subgradient

In order to train the model via the subgradient method we'll need to know what the subgradients of the hinge loss actually are. Let's calculate that now. Since the hinge loss is piecewise differentiable, this is pretty straightforward. We have $$\frac{\partial}{\partial w_i} (1 - t(\mathbf{w}\mathbf{x} + b)) = -tx_i$$ and $$\frac{\partial}{\partial w_i} \mathbf{0} = \mathbf{0}$$ The first subgradient holds for $ty < 1$ and the second holds otherwise.

### The Code

Okay, now with the math out of the way, let's get to some of the code. For this task we'll be classifying text messages as "ham" or "spam" using
the data available here. Download the file "spam.csv" and extract it to a
location of your choice. I created a subfolder in my code directory called `data`

and placed it there.

#### Loading the Data

Let's create a function to load and transform the data. We'll use the scikit-learn `CountVectorizer`

class to create
a bag-of-words representation for each input text message in the training data set. We'll also use some of the `gensim`

preprocessing utilities to help clean up the inputs. I created the following functions which do all of the above.

```
import numpy as np
from gensim.parsing.preprocessing import preprocess_string
from sklearn.feature_extraction.text import CountVectorizer
def clean_text(l):
fields = l.strip().split(',')
return preprocess_string(fields[1])
def get_texts(lines):
return list(map(clean_text, lines))
def convert_label(l, hs_map):
fields = l.strip().split(',')
key = preprocess_string(fields[0])[0]
return hs_map[key]
def get_labels(lines, hs_map):
return list(map(lambda x: convert_label(x, hs_map), lines))
def load_data(file='data/spam.csv'):
lines = open(file, 'r', encoding='ISO-8859-1').readlines()
lines = lines[1:] # remove header line
hs_map = {'ham': 1, 'spam': -1}
y = get_labels(lines, hs_map)
texts = get_texts(lines)
texts = [' '.join(x) for x in texts]
bow = CountVectorizer()
X = bow.fit_transform(texts)
return X, np.array(y)
```

We make generous use of the Python `map`

functionality to map various preprocessing functionality across our strings. The actual
bag of words vectorization is a simple one-line call thanks to sklearn's `CountVectorizer`

. Note that `CountVectorizer`

returns a scipy `sparse matrix`

which will introduce some caveats into our training code. More on that in a bit.

Next, let's start implementing the loss functions. This is pretty simple with `numpy`

. We have,

```
def hinge_loss(t, y):
return np.maximum(0, 1 - t * y)
def hinge_subgrad(t, y, x):
if t * y < 1:
subgrad = (-t * x).toarray()
else:
subgrad = np.zeros(x.shape)
return subgrad
```

We have to call the `.toarray()`

method on the first clause of `hinge_subgrad`

due to the fact that `x`

will be a sparse matrix. This method just turns `x`

into a regular numpy array. Note also that we use `np.maximum`

rather than `np.max`

. `np.maximum`

is more along the lines of the vector-based version of `np.max`

.

Now, the hinge loss as we've implemented it only calculates the loss for a single training example $\{(x, t)\}$. We want to add in a function which aggregates the loss across all examples in the training set. We can do that with the following function

```
def loss(w, X, y):
preds = X @ w
losses = [hinge_loss(t, y_) for t, y_ in zip(y, preds)]
return np.mean(losses)
```

Finally, we'll add a couple of functions that handle making predictions for us given `w`

and `x`

. Note, I'm not
including a variable `b`

in either of these. That's because we can collapse `b`

into `x`

by adding an
extra dimension to `x`

and `w`

with starting value 1.

```
def predictor(w, x):
return x @ w
def predict(w, X):
preds = X @ w
z = (preds > 0).astype(int)
z[z == 0] = -1
return z
```

The function `predictor`

takes in a single training value `x`

and the weight vector `w`

and returns an
unnormalized prediction `y = x @ w`

which is fed to our `hinge_loss`

function. On the other hand, `predict`

takes in an entire batch of training inputs `X`

and returns an output vector of 1's and -1's. We'll use this to calculate the
accuracy of our model during both training and the final runtime.

Finally, we need a simple method that we can use to initialize our weight vector `w`

at the start of training.

```
def init_w(x):
return np.random.randn(x.shape[1])
```

### Coding the Subgradient Method

Finally, we arrive at our pinnacle moment, that of coding up the subgradient method. The code for this method is a bit involved, but it contains some design choices that are interesting to consider. Let's take a look at the finished product first, then walk through it step by step.

```
def subgrad_descent(targets, inputs, w, eta=0.5, eps=.001):
curr_min = sys.maxsize
curr_iter, curr_epoch = 0, 0
while True:
curr_epoch += 1
idxs = np.arange(targets.shape[0])
np.random.shuffle(idxs)
targets = targets[idxs]
inputs = inputs[idxs]
for i, (t, x) in enumerate(zip(targets, inputs)):
curr_iter += 1
if curr_iter % 100 == 0:
preds = predict(w, inputs)
curr_acc = np.mean(preds == targets)
converged = curr_acc > .95
if converged:
return w, inputs, targets
print(f"Current epoch: {cur_epoch}")
print(f"Running iter: {curr_iter}")
print(f"Current loss: {cur_min}")
print(f"Current acc: {curr_acc}\n")
y = predictor(w, x)[0]
subgrad = hinge_subgrad(t, y, x)
w_test = np.squeeze(w - eta * subgrad)
obj_val = loss(w_test, inputs, targets)
if obj_val < cur_min:
cur_min = obj_val
w = w_test
```

Now, to start, I'll say that my method is subtly different from the subgradient method I detailed in the mathematical walkthrough. That's
because in that version, all objective function values $f(w^k)$ are kept track of from start to end and the max is taken at the end of the
iteration. I take a slightly different approach. I start with an iteration $w^k$ and I only step to $w^{k+1}$ if it decreases the current
minimum loss seen across the dataset. This seems to work well, allowing me to achieve greater than 95% accuracy once the model finishes training,
but I encourage you to try out both approaches and see which works best. You'll see that I start with a current objective function value of
`sys.maxsize`

. This is the max value that Python can represent, so any subsequent function value iterates are guaranteed to be less
than this value. Next, I iterate `while True`

and only break from the iteration once I achieve some desired accuracy on the training
set (here I have this set to 95%). Every 100 iterations, I evaluate the current model on the training set and see if it achieves this threshold.
Note, due to time constraints, I'm being a bit careless and not creating validation sets, etc. That's because the purpose of this tutorial is to
demonstrate the viability of the subgradient method, but not to serve as an example of training procedure best practices. I encourage you to clean
things up in your own code. It's a great learning exercise.

Now, the method shuffles the dataset at the start of each epoch. You can see this in the lines

```
curr_epoch += 1
idxs = np.arange(targets.shape[0])
np.random.shuffle(idxs)
targets = targets[idxs]
inputs = inputs[idxs]
```

This is a simple shuffling method based on randomizing the indices into the dataset. While sklearn has built in shuffling methods, I try to rely on it as little as possible because, quite frankly, what's the fun in having some external library handle everything? The Flamethrower AI ethos is built on learning by building everything from scratch, so that's what I'm aiming to do here.

Next, I iterate over each example in the training set. In effect, I'm performing *stochastic* subgradient descent.
At each step, I get the unnormalized prediction `y = predictor(w, x)[0]`

using the current weight vector `w`

and calculate
the corresponding `subgrad`

. I then check what the updated weight vector as calculated by the subgradient method would be (given here
as `w_test = np.squeeze(w - eta * subgrad)`

) and evaluate the loss that that updated vector achieves across the *entire* dataset.
If that `obj_val`

is less than the `cur_min`

then I set it to be the new `w`

. Otherwise, I skip the update. Either
way, that completes the iteration, and the next round of updates is subsequently computed in the same manner, ad infinitum, until some desired accuracy
is reached.

# Conclusion

That about sums up the subgradient method. I was able to reach >95% accuracy with my hand-coded method, and I'm sure it's even possible to do better
than that with some simple tweaks to the data normalization process, refining the training updates, fiddling with learning rates, etc. However, I
think what I have here provides a solid demonstration of the workability of this method. Note that training *will be slow.* This is for a couple of
reasons. First, the rate of convergence os the subgradient method is on the slower side. Second, we're implementing this in pure Python, so we're
certainly not going to be breaking any speed records here. Don't get discouraged if you compare this method against the built-in Sklearn SVM. The
sklearn code is actually just a thin wrapper over `libsvm`

which is implemented entirely in heavily optimized C++, so naturally it's going to
be a lot faster. That said, this method converges relatively quickly on this dataset. I was getting good results in 5-10 minutes, certainly a lot less
time than you'd have to wait to train your favorite deep learning classifier on a really large dataset (not this one). Anyway, I hope this tutorial
introduced you to some of the fun of the subgradient method and advanced optimization techniques. If you want to get access to the repository with the
full solutions code as well as access to more advanced tutorials like this one, then I encourage you to sign up for a full Flamethrower AI membership.
It's only $20/month, and you get access to all our current and future courses for that one, low, monthly price. Cheers!