Everyone Should Learn Optimal Transport, Part 1
Published:
In my opinion, optimal transport (OT) is a seriously underrated topic. I think part of the reason is the way OT is often introduced: as an optimization problem or a metric on probability distributions. While these are interesting to study for their own sake, OT presents a tool much more powerful and arguably even fundamental to the study of probability. In a series of two posts, I’m going to present a message that is well understood by experts but often missed by the uninitiated:
Optimal transport gives us calculus on the space of probability distributions.
While I will begin by discussing on a high level what I meant by this message, I will later present two concrete examples of using OT as a tool of calculus that will make this very clear. These two examples will form the basis of the two blog posts.
From General (But Weak) to Specific (But Powerful)
To start, I think we should review why we want to have calculus in the first place. The space of probability measures $\mathscr{P}$ naturally form a set, which in the language of set theory is very general. However, when studying probability measures, we often desire more structure. For example, we often want to equip the space with the topology of weak convergence (but really weak-$*$ convergence, or convergence in distribution):
\[\mu_n \xrightarrow{d} \mu \quad \text{ if } \quad \int \varphi \, d\mu_n \to \int \varphi \, d\mu \,, \text{ for all } \varphi \text{ continuous and bounded} \,.\]In simpler terms, this defines what it means for a subset of $\mathscr{P}$ to be open, and the structure of open sets defines a topology. (In fact, this topology can be metrized but we won’t discuss it here.)
Most of you reading this post has probably seen optimal transport defined by a variant of the following problem. Let $\mu, \nu \in \mathscr{P}(\mathbb{R}^d)$ be two probability distributions, and $\Gamma(\mu,\nu)$ be the set of all couplings (i.e. joint distributions with marginals $\mu$ and $\nu$). Then we can define the 2-Wasserstein distance as
\[W_2(\mu,\nu)^2 := \inf_{ \gamma \in \Gamma(\mu,\nu) } \int \|x-y\|^2 \, d\gamma(x,y) \,.\]We can show indeed that $W_2$ is a metric on $\mathscr{P}_2$ (the space of distributions with second moments), and this provides a specific notion of distance. Consequently, $(\mathscr{P}, W_2)$ forms a metric space. Now, as many of you know or suspect, the metric structure is much more powerful than a topology, and there are many more theorems we can prove about metric spaces than topological spaces.
At the same time, Euclidean spaces like $\mathbb{R}^d$ are also metric spaces, but they are far easier to work with compared to metric spaces. The key gap between Euclidean and metric spaces is differentiability. In fact, spaces with a metric and a differentiable structure have a name, they are called Riemannian manifolds. In particular, we need a special inner product called the Riemannian metric $g$, and an affine connection $\nabla$, which can be thought of as identifying a unique “second derivative.” The above discussion can be summarized in the table below:
Space | Notation | Structure |
---|---|---|
Set | $\mathscr{P}$ | None |
Topological Space | $(\mathscr{P}, \tau)$ | Open Sets |
Metric Space | $(\mathscr{P}, d)$ | Distance |
Riemannian Manifold | $(\mathscr{P}, g, \nabla)$ | Calculus |
Now, to fully develop the theory of the Riemannian manifold structure of optimal transport, sometimes called Otto calculus [Ott01], requires a whole book, which I won’t do here. I will direct the interested reader to Villani’s Topics in Optimal Transportation [Vil03] as an excellent introductory read on this subject.
However, I will sketch out a concrete example next to illustrate an application of the calculus tools that OT provides us.
The PL-Inequality and Exponential Convergence
In optimization (and beyond), we often care about the convergence of gradient descent (or gradient flow in continuous time). It turns out, the exact characterization of exponential convergence is due to Polyak and Łojasiewicz [Pol63, Łoj63]. We will define the inequality as follows.
Definition We say a function $F \in C^1( \mathbb{R}^d )$ where $\inf F = 0$ satisfies a PL-inequality with constant $\alpha>0$ if
\[\| \nabla F(x) \|^2 \geq 2 \alpha F(x) \,, \quad \forall x \in \mathbb{R}^d \,.\]Once again, the importance of the PL-inequality is that it exactly characterizes the exponential convergence rate of gradient flow. We will make this precise in the following statement.
Theorem Suppose ${X_t}$ is the solution of the gradient flow ODE with respect to the potential $F \in C^1( \mathbb{R} )$ and initial condition $x_0 \in \mathbb{R}^d$
\[\dot X_t = - \nabla F(X_t) \,, \quad X_0 = x_0 \,.\]Then $F$ satisfies a PL-inequality with constant $\alpha > 0$ if and only if gradient flow converges exponential fast for all initial conditions, i.e.
\[F(X_t) \leq e^{-2 \alpha t} F(X_0) \,, \quad \forall X_0 = x_0 \in \mathbb{R}^d \,, t \geq 0 \,.\]While this may seem a bit mysterious at the moment, the point of this characterization is that the derivation is straight forward exercise in calculus. In fact, we can quickly sketch the forward direction.
We start by computing the time derivative of $F(X_t)$
\[\partial_t F(X_t) = \langle \nabla F(X_t) , \dot X_t \rangle = \langle \nabla F(X_t), -\nabla F(X_t) \rangle = - \|\nabla F(X_t)\|^2 \,.\]Now we see this term exactly corresponds to the PL-inequality! Applying the inequality gives us
\[\partial F(X_t) \leq - 2 \alpha F(X_t) \,.\]At this point, it’s a direct application of Grönwall’s inequality to get
\[F(X_t) \leq e^{-2\alpha t} F(X_0) \,, \quad \forall t \geq 0 \,,\]which is the desired exponential convergence result.
To summarize this section, I want to once again remark that this derivation is only straight forward because we had access to calculus. We will see in the next section that this is not always so clear, and how Otto calculus provides exactly what we need.
On the Convergence of Langevin Diffusion
A widely studied process in physics, statistics, and machine learning is the Langevin diffusion, defined by the following stochastic differential equation (SDE)
\[dX_t = - \nabla F(X_t) \, dt + \sqrt{2} \, dB_t \,,\]where ${B_t}$ is a standard Brownian motion. Effectively, it’s a gradient flow (see equation 4) plus Brownian noise.
Now, there are many reasons why Langevin is nice, but one particularly helpful property is the stationary distribution is the Gibbs distribution, i.e.
\[\nu(dx) \propto e^{ -F(x) } \, dx \,.\]It’s rare to be able to characterize the stationary distribution of a stochastic process in such an explicit and nice closed form, and we can use this to show many desirable properties. However, when the time $t$ is finite, how good is Langevin at approximating the stationary distribution?
This is a question of convergence, one that has been well studied over the years. In particular, we would like to understand the necessary and sufficient conditions for exponential convergence. In particular, we can consider the following result from the excellent reference by Bakry, Gentil, and Ledoux [BGL14].
Theorem (Exponential Decay in Entropy) Let $\mu_t = \mathscr{L}(X_t)$ be the time marginal distribution of Langevin diffusion, and let $H_\nu(\mu_t)$ denote the Kullback–Leibler (KL) divergence of $\mu_t$ with respect to the Gibbs distribution $\nu$. Then the exponential decay of KL with rate $2\alpha > 0$, i.e.
\[H_\nu(\mu_t) \leq e^{ - 2\alpha t} H_\nu(\mu_0) \,, \quad \forall t \geq 0 \,, \mu_0 \text{ such that } H_\nu(\mu_0) < \infty \,,\]is equivalent to the logarithmic Sobolev inequality (LSI) with constant $\alpha > 0$, i.e.
\[H_\nu(\mu) \leq \frac{1}{2 \alpha} I_\nu(\mu) = \frac{2}{\alpha} \int \left\| \nabla \log \frac{d\mu}{d\nu} \right\|^2 d\mu_t \,, \quad \forall \mu \text{ such that } H_\nu(\mu) < \infty \,.\]When I first read this result and the somewhat technical derivation behind the proof, it was daunting to say the least. More importantly, I didn’t know what to make of it. In other words, I couldn’t see the structure of this proof. It was one of those partial differential equation (PDE) proofs that just kind of “fell into place,” if you know what I mean.
However, the curious reader might have already noticed something interesting: the logarithmic Sobolev inequality looks quite similar to the PL-inequality. As we will see next, this is not a coincidence at all.
Wasserstein Gradient Flow
Finally, we come to the main result of this blog post: the famous Jordan–Kinderlehrer–Otto (JKO) theorem [JKO98]. In the simplest term possible, it says the following.
The time marginal $\mu_t = \mathscr{L}(X_t)$ of the Langevin diffusion is following the path of a certain gradient flow of the KL-divergence in the space of probability distributions.
This result directly led to the insight of Otto calculus: the space of probability distributions $\mathscr{P}$ is naturally equipped with a differentiable structure! Perhaps I’m a slow learner, but overtime I have come to increasingly appreciate this result.
To be precise, the gradient flow is defined by a continuous time limit of the proximal gradient operator:
\[\mu_{t+h} \approx \inf_\rho H_\nu(\rho) + \frac{1}{2h} W_2( \mu_t, \rho )^2 \,, \quad \text{ for small } h > 0 \,.\]If this operation is performed on a Riemannian manifold, with $W_2$ replaced by the geodesic distance, this exactly recovers the Riemannian gradient flow. However, the surprising part of this result is that [JKO98] showed the limit solves the Fokker–Planck equation, i.e. the equation characterizing the law of Langevin diffusion
\[\partial_t \mu_t = \text{div} \left( \mu_t \nabla \log \frac{d\mu_t}{d\nu} \right) \,.\]Therefore, from now on, we will use the following notation to denote the Fokker–Planck equation
\[\partial_t \mu_t = - \text{grad}_{W_2} H_\nu(\mu_t) \,.\]In fact, the famous de Bruijn’s identity is straight forward consequence of Otto calculus
\[\partial_t H_\nu(\mu_t) = - \left\| \text{grad}_{W_2} H_\nu(\mu_t) \right\|_{W_2}^2 = - I_\nu(\mu_t) \,,\]which directly mirrors the calculation in equation 6.
At this point, we can return to the logarithmic Sobolev inequality and recognize that this is exactly the PL-inequality using Otto calculus!
\[H_\nu(\mu) \leq \frac{1}{2 \alpha} \| \text{grad}_{W_2} H_\nu(\mu) \|_{W_2}^2 \,, \quad \forall \mu \text{ such that } H_\nu(\mu) < \infty\]Well, at this point we will simply plug in Grönwall’s inequality once again to recover the desired exponential decay result
\[H_\nu(\mu_t) \leq e^{-2\alpha t} H_\nu(\mu_0) \,, \quad \forall t \geq 0\,, \mu_0 \text{ such that } H_\nu(\mu_0) < \infty \,.\]Final Words
I think the most insight I gained through learning optimal transport is the fact that calculus is powerful. The drastic simplification of several well known PDEs using Wasserstein gradient flow is an incredible breakthrough, and we have just observed one important application of it for the Fokker–Planck equation. I hope this post already is sufficient motivation for “everyone” to learn optimal transport.
That being said, as I mentioned in the introduction, I intend this to be the first of two blog posts on this topic. In the next post, I intend to explore the geometric aspect of the Wasserstein manifold interpretation, and discuss an application to mean field neural networks.
Furthermore, I intentionally left out many important steps and details. In particular, the Wasserstein gradient can be computed explicitly. There are many nice references for optimal transport today that cover these details, but I strongly recommend [Vil03] as an accessible introduction.
References
- [BGL14] Bakry, D., Gentil, I., & Ledoux, M. (2014). Analysis and geometry of Markov diffusion operators (Vol. 103). Cham: Springer.
- [JKO98] Jordan, R., Kinderlehrer, D., & Otto, F. (1998). The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1), 1-17.
- [Łoj63] Lojasiewicz, S. (1963). A topological property of real analytic subsets. Coll. du CNRS, Les équations aux dérivées partielles, 117(87-89), 2.
- [Ott01] Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation.
- [Pol63] Polyak, B. T. (1963). Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4), 643-653.
- [Vil03] Villani, C. (2003). Topics in optimal transportation (Vol. 58). American Mathematical Soc.