** This tutorial is currently a work in progress **

A Synopsis of DeepMind's Sobolev Training of Neural Networks

Here is a link to the original paper so that you can follow along as you read.
Here is a link to the code used for the tutorial.

I'll preface this tutorial by stating that despite the fancy formalisms of analysis in which this paper is dressed, at its core the main premise is quite simple: take a neural network and achieve better results by training to not only optimize function values, but derivative values as well.

That is, when training an ordinary neural network, we have a set of training data $\{x_i, f(x_i)\}_{i=1}^N$ along with a model $y_{\theta}(x)$ parametrized by weights $\theta$, and we seek to have $y_{\theta}$ match $f$ as closely as possible while achieving low generalization error by minimizing the training objective $$\sum_{i=1}^N l(y_{\theta}(x_i), f(x_i))$$ where $l$ is a loss function such as mean-squared error or the cross-entropy loss.

Sobolev training augments this traditional paradigm by making the assumption that in addition to the training data for $f$, we have access to training datasets for all of $f$'s derivatives up to order $K$, i.e. we have sets $\{x_i, D^1_{\mathbf{x}}(f(x_i))\}_{i=1}^N, \ldots, \{x_i, D^K_{\mathbf{x}}(f(x_i))\}_{i=1}^N$. With this data, we seek to not only match the outputs our model $y_{\theta}$ to the sampled function values present in the training data but also match our model's derivatives to the sampled training derivatives. The training process is generalized in the way you would expect. Our training objective becomes $$\sum_{i=1}^N \left[ l(y_{\theta}(x_i), f(x_i)) + \sum_{j=1}^K l(D^j_{\mathbf{x}}(y_{\theta}(x_i)), D^j_{\mathbf{x}}(f(x_i)))\right]$$ Of course, this is all well and good, but the natural question is "how does one go about acquiring derivative samples for supervision of the Sobolev training objective?" The authors don't really give us an answer in the general case, but they do put forth a couple of circumstances under which such training data might be readily available.

The first example they give pertains to those instances where the ground truth function is itself a neural network which the practitioner has a-priori access to. This is the case for policy distillation in RL, neural network compression (expressing an equivalent model using fewer weights), and for the prediction of synthetic gradients.

They also mention that in certain online instances, even if access to analytic-form gradients for sampling is unavailable, those same gradients can be approximated using the technique of finite differences. More on that in a bit.

What's a Sobolev Space and Why the Formalism?

Simply put, a Sobolev Space is a vector space where the vectors are functions and their derivatives with norms defined on them. The introduction of a norm gives us a way of quantifying the distance between any two functions or derivatives, and thus allows us to optimize functions like the one given as the Sobolev training objective. There's a lot of fancy math going on with Sobolev Spaces under the hood. For example, when one defines such a space, they normally extend the definition of differentiability to apply in a *weak* sense. This is done to allow for the convergence of arbitrary Cauchy sequences in order to turn the space into a complete and Banach space. Those are both functional analytic notions, and if you'd like a primer on the underlying math, I encourage you to consult my notes on functional analysis on this blog.

In order to make things crystal clear, I'll quickly introduce the natural norm induced on a Sobolev space.

Definition: A (one-dimensional) Sobolev space $W^{k,p}$ is the subset of functions $f$ in $L^p(\mathbb{R})$ such that $f$ and its weak derivatives up to some order $k$ have a finite $L^p$ norm for given $p (1 \leq p \leq \infty)$.

With that, the norm on $f$ is defined as $$\| f \|_{k, p} = \left(\sum_{i=0}^k \| f^{(i)}\|_p^p\right)^{\frac{1}{p}} = \left( \sum_{i=0}^k \int |f^{(i)}(t)|^p \ dt\right)^{\frac{1}{p}}$$