The Variational Calculus: Part 1
In introductory calculus, we learn to take derivatives of functions of a single variable. In vector calculus, we learn to do the same for vector-valued functions, although here the analysis becomes more complex as there is no longer a single, canonical notion of a derivative. Depending on the perspective you wish to take, you can work with directional derivatives, partial derivatives, gradients, or Jacobians (total derivatives). A more technical view of the derivative then flows from differential geometry, which takes as its focal point the differential, which gives a mapping between the tangent spaces of manifolds.
Yet, outside of physics, it’s rare to encounter one of the most powerful tools in the mathematician’s toolbox: the functional (or variational) derivative. The variational derivative takes as its domain the infinite-dimensional space of functions and as its codomain the space of reals. The variational derivative is relied upon heavily in the Lagrangian and Hamiltonian formulations of mechanics, electrodynamics, and functional analysis more generally. Yet, for some reason, its use is hardly explored outside the advanced echelons of mathematics and physics.
In this post, I hope to explain the variational derivative in an intuitive way and show how it can be used outside of physics, for example in applied fields such as machine learning.
Defining the Variational Derivative
Classically speaking, the variational derivative is defined in the context of function defined in terms of an integral, though this need not be the only setting in which it is applied. However, we’ll also use this formulation as our starting point. We typically define an action functional as follows $$J[y] = \int_a^b f(x, y, y’) dx$$ where $y$ is a function of $x$, $y’$ is the derivative of $y$ with respect to $x$, and $f$ is a function of $x$, $y$, and $y’$. In the physics context, we typically take $f$ to be the Lagrangian of the system, i.e. $L = T - U$ where $T$ is the kinetic energy and $U$ is the potential energy.
It’s worth thinking about the relationship between $f$ and $J$ in addition to their domains. $f$ is typically taken to be a real-valued function, though it need not be. All that is required is that it be integrable. Furthermore, it is high-dimensional in that $x$ is typically taken to be an element of the configuration space of some physical system, i.e. the set of $\mathbb{R}^3$ coordinates of all particles in the system. Related to the configuration space of a system is the phase space, which is the set of all possible positions and momenta of the particles. The phase space gives a complete picture of the system’s dynamics, and is typically used in the Hamiltonian formulation of mechanics.
There’s a neat way to conceptualize the relationship between configuration space and phase space. One can think of the points in configuration space of a system as defining a high-dimensional manifold, $\mathcal{M}$. At any point $q \in \mathcal{M}$, we can examine the tangent space $T_q\mathcal{M}$. Because velocity is the derivative of position with respect to time, $T_q\mathcal{M}$ is the set of all possible velocity vectors of the particle. To get momenta from velocities, we use the definition $\mathbf{p} = m\mathbf{v}$ where $m$ is the mass of the particle. Therefore, we can think of the momentum as living in the cotangent space $T^*_q\mathcal{M}$. Furthermore, by applying the Legendre transform to the Lagrangian, we get the Hamiltonian function on the cotangent bundle $T^*M$. More on that later.
Mechanics on Manifolds
Defining mechanics in the setting of manifolds leads to a number of nice generalizations. For example, taking $M$ to be a Riemannian manifold, we can view the kinetic energy, $T$1, as a quadratic form on the tangent bundle $TM$, i.e. $$T = \frac{1}{2} \langle \mathbf{v}, \mathbf{v} \rangle, \quad \mathbf{v} \in TM_q$$ and, similarly, the potential energy is just a differentiable function $U: M \to \mathbb{R}$.
By dropping in different potential energy functions, we can represent a variety of physical systems. For example, if we use the gravitational potential, $$U = -\frac{Gm_1m_2}{r}$$ where $G$ is the gravitational constant, $m_1$ and $m_2$ are the masses of the two particles, and $r$ is the distance between them, we can model systems such as those involving planetary motion. Alternatively, we could use the electrostatic potential, $$V = k\frac{q_1q_2}{r}$$ to model particles interacting via the electromagnetic force.
Hamilton’s Principle
Hamilton’s principle states that the motion of a system of particles from time $t_0$ with configuration $\mathbf{q_0} = \mathbf{q}(t_0)$ to time $t_1$ with configuration $\mathbf{q_1} = \mathbf{q}(t_1)$ in configuration space is such that the action functional involving the Lagrangian $$J[\mathbf{q}] = \int_{t_0}^{t_1} L(t, \mathbf{q}, \dot{\mathbf{q}}) dt$$ is minimized or stationary, i.e. at the point where its variation is zero.
Defining the Variation
The variational derivative is the function-space analogue of the derivative in single-variable calculus or the gradient in vector calculus. By deriving the variational derivative and setting it equal to zero, we can solve for the path that the system will take in configuration space.
Some subtleties arise when working in the infinite-dimensional function space, however. Much like in the vector setting, we need to define a norm on the function space in order to measure how close together any two functions are. There exist a variety of standard choices. For example, in the space $C^n[x_0, x_1]$, the space of functions that are continuously differentiable up to order $n$ on the interval $[x_0, x_1]$, we can use any of the $\lVert \cdot \rVert_p$ norms, where $p \in [1, \infty]$. The $L^p$2 norms are another standard choice and are defined as
$$L^p(f) = \left( \int_{x_0}^{x_1} |f(x)|^p dx \right)^{1/p}.$$
Unlike in the vector case, it turns out that the choice of norm is critically important, as different norms may lead to different extrema of the functional. Functions that are close in one norm need not be close in another.
Using Limits
Having noted this, we can begin to define the variational derivative. Let’s briefly revisit our notation. Let $J: X \to \mathbb{R}$ be a functional of the form $$J[y] = \int_a^b f(x, y, y’)\ dx$$ defined on $(X, \lVert \cdot \rVert)$, and let $S \subseteq X$. We say that $J$ has a local maximum at $y \in S$ if there exists an $\epsilon > 0$ such that for all $\hat{y} \in S$ satisfying $\lVert \hat{y} - y \rVert < \epsilon$, we have $J[\hat{y}] - J[y] \leq 0$. Similarly, $J$ is said to have a local minimum at $f$ in $S$ if $f$ is a local maximum of $-J$.
Now, given any $\hat{y}$ satisfying $\lVert \hat{y} - y \rVert < \epsilon$, we can define a perturbation of $f$ to be the function $g$ satisfying $$\hat{y} = y + \epsilon \eta.$$ Such a $\eta$ always exists, and we can generate all the functions within an $\epsilon$-ball of $f$ as perturbations of this type. To that end, we can define the set $$H = \{\eta \in X : y + \epsilon \eta \in S \}.$$
Now, we can define the first variation of $J$ at $f$ to be the function $$\delta J[y+\epsilon \eta] = \lim_{\epsilon \to 0} \frac{J[y + \epsilon \eta] - J[y]}{\epsilon} = \frac{d}{d\epsilon} J[y + \epsilon \eta]\big|_{\epsilon = 0}$$ This is also sometimes called the Gateaux derivative of $J$. The Gateaux derivative generalizes the notion of the directional derivative to function spaces, i.e. it takes the derivative of $J$ at $f$ in the direction of $g$. There are an infinite number of Gateaux derivatives at a single point $f$.
$\delta J$ is a nice operator to consider, because it gives us a necessary condition for establishing a local extremum of $J$.
Theorem If $y$ is a local extremum of $J$, then $\delta J[y + \epsilon \eta] = 0$ for all $\eta \in H$.
Proof. To see why, assume wlog that $y$ is a local minimum of $J$. Then, $J[y + \epsilon \eta] - J[f] \geq 0$ for sufficiently small $\epsilon$. We can take the limit as $\epsilon \to 0$ from both sides to see that $$\delta J[y + \epsilon \eta]^{+} = \lim_{\epsilon \to 0^+} \frac{J[y + \epsilon \eta] - J[y]}{\epsilon} \geq 0$$ and $$\delta J[y + \epsilon \eta]^{-} = \lim_{\epsilon \to 0^-} \frac{J[y + \epsilon \eta] - J[y]}{\epsilon} \leq 0.$$ Thus, $\delta J[y + \epsilon \eta] = 0. \quad \square$
Another Formulation
So far, we’ve defined the first variation using the Gateaux derivative. However, we can arrive at it in another way. First, note that it will be convenient to rewrite $J$ as a function of $\epsilon$, i.e. we define $$J(\epsilon) \equiv J[y + \epsilon \eta] = \int_a^b f(x, y + \epsilon \eta, y’ + \epsilon \eta’)\ dx$$ Then, we can define the total variation of $J$ as $$\Delta J = J(\epsilon) - J(0)$$ Expanding using the definition of $J$ gives us $$\Delta J = \int_a^b \left[ f(x, y + \epsilon \eta, y’ + \epsilon \eta’) - f(x, y, y’) \right]\ dx$$ Assuming $f$ has a sufficient number of continuous partial derivatives, we can expand this using a standard Taylor series expansion to get $$\Delta J = \delta J + \frac{1}{2} \delta^2 J + \mathcal{O}(\epsilon^3)$$ Do you see now why we call $\delta J$ the first variation of $J$? In the same vein, we call $\delta^2 J$ the second variation of $J$.
An equivalent way to write the first variation of $J$ is as the derivative $$\delta J = \frac{dJ(\epsilon)}{d\epsilon} \big|_{\epsilon = 0}$$ which evaluates to (just taking partial derivatives with the product rule) $$\delta J = \int_a^b \left[ \frac{\partial f}{\partial y} \eta + \frac{\partial f}{\partial y’} \eta’ \right]\ dx.$$
As we’ve shown already, a necessary condition for $y$ to be a local extremum of $J$ is that $\delta J(\eta, y) = 0$ for all $\eta \in H$. However, it is not a sufficient condition. $y$ may satisfy $\delta J(\eta, y) = 0$ for all $\eta \in H$ and still not be a local extremum (e.g., think saddle points).
However, this condition is important enough to warrant its own terminology. If $\delta J(\eta, y) = 0$ for all $\eta \in H$, then we say that $J$ is stationary at $y$.
Setting the first variation equal to zero is the infinite-dimensional analogue of setting the gradient of a function equal to zero.
We still have a ways to go in applying the first variation to solving variational calculus problems involving optimization of an action functional. In the next post, I’ll cover the Euler-Lagrange equation and show how it can be used to solve the types of problems with which we’re concerned. Stay tuned!