Mufan (Bill) Li

Everyone Should Learn Optimal Transport, Part 1

2024-06-07T00:00:00-04:00

In my opinion, optimal transport (OT) is a seriously underrated topic. I think part of the reason is the way OT is often introduced: as an optimization problem or a metric on probability distributions. While these are interesting to study for their own sake, OT presents a tool much more powerful and arguably even fundamental to the study of probability. In a series of two posts, I’m going to present a message that is well understood by experts but often missed by the uninitiated:

Optimal transport gives us calculus on the space of probability distributions.

While I will begin by discussing on a high level what I meant by this message, I will later present two concrete examples of using OT as a tool of calculus that will make this very clear. These two examples will form the basis of the two blog posts.

From General (But Weak) to Specific (But Powerful)

To start, I think we should review why we want to have calculus in the first place. The space of probability measures $\mathscr{P}$ naturally form a set, which in the language of set theory is very general. However, when studying probability measures, we often desire more structure. For example, we often want to equip the space with the topology of weak convergence (but really weak-$*$ convergence, or convergence in distribution):

\[\mu_n \xrightarrow{d} \mu \quad \text{ if } \quad \int \varphi \, d\mu_n \to \int \varphi \, d\mu \,, \text{ for all } \varphi \text{ continuous and bounded} \,.\]

In simpler terms, this defines what it means for a subset of $\mathscr{P}$ to be open, and the structure of open sets defines a topology. (In fact, this topology can be metrized but we won’t discuss it here.)

Most of you reading this post has probably seen optimal transport defined by a variant of the following problem. Let $\mu, \nu \in \mathscr{P}(\mathbb{R}^d)$ be two probability distributions, and $\Gamma(\mu,\nu)$ be the set of all couplings (i.e. joint distributions with marginals $\mu$ and $\nu$). Then we can define the 2-Wasserstein distance as

\[W_2(\mu,\nu)^2 := \inf_{ \gamma \in \Gamma(\mu,\nu) } \int \|x-y\|^2 \, d\gamma(x,y) \,.\]

We can show indeed that $W_2$ is a metric on $\mathscr{P}_2$ (the space of distributions with second moments), and this provides a specific notion of distance. Consequently, $(\mathscr{P}, W_2)$ forms a metric space. Now, as many of you know or suspect, the metric structure is much more powerful than a topology, and there are many more theorems we can prove about metric spaces than topological spaces.

At the same time, Euclidean spaces like $\mathbb{R}^d$ are also metric spaces, but they are far easier to work with compared to metric spaces. The key gap between Euclidean and metric spaces is differentiability. In fact, spaces with a metric and a differentiable structure have a name, they are called Riemannian manifolds. In particular, we need a special inner product called the Riemannian metric $g$, and an affine connection $\nabla$, which can be thought of as identifying a unique “second derivative.” The above discussion can be summarized in the table below:

Space	Notation	Structure
Set	$\mathscr{P}$	None
Topological Space	$(\mathscr{P}, \tau)$	Open Sets
Metric Space	$(\mathscr{P}, d)$	Distance
Riemannian Manifold	$(\mathscr{P}, g, \nabla)$	Calculus

Now, to fully develop the theory of the Riemannian manifold structure of optimal transport, sometimes called Otto calculus [Ott01], requires a whole book, which I won’t do here. I will direct the interested reader to Villani’s Topics in Optimal Transportation [Vil03] as an excellent introductory read on this subject.

However, I will sketch out a concrete example next to illustrate an application of the calculus tools that OT provides us.

The PL-Inequality and Exponential Convergence

In optimization (and beyond), we often care about the convergence of gradient descent (or gradient flow in continuous time). It turns out, the exact characterization of exponential convergence is due to Polyak and Łojasiewicz [Pol63, Łoj63]. We will define the inequality as follows.

Definition We say a function $F \in C^1( \mathbb{R}^d )$ where $\inf F = 0$ satisfies a PL-inequality with constant $\alpha>0$ if

\[\| \nabla F(x) \|^2 \geq 2 \alpha F(x) \,, \quad \forall x \in \mathbb{R}^d \,.\]

Once again, the importance of the PL-inequality is that it exactly characterizes the exponential convergence rate of gradient flow. We will make this precise in the following statement.

Theorem Suppose ${X_t}$ is the solution of the gradient flow ODE with respect to the potential $F \in C^1( \mathbb{R} )$ and initial condition $x_0 \in \mathbb{R}^d$

\[\dot X_t = - \nabla F(X_t) \,, \quad X_0 = x_0 \,.\]

Then $F$ satisfies a PL-inequality with constant $\alpha > 0$ if and only if gradient flow converges exponential fast for all initial conditions, i.e.

\[F(X_t) \leq e^{-2 \alpha t} F(X_0) \,, \quad \forall X_0 = x_0 \in \mathbb{R}^d \,, t \geq 0 \,.\]

While this may seem a bit mysterious at the moment, the point of this characterization is that the derivation is straight forward exercise in calculus. In fact, we can quickly sketch the forward direction.

We start by computing the time derivative of $F(X_t)$

\[\partial_t F(X_t) = \langle \nabla F(X_t) , \dot X_t \rangle = \langle \nabla F(X_t), -\nabla F(X_t) \rangle = - \|\nabla F(X_t)\|^2 \,.\]

Now we see this term exactly corresponds to the PL-inequality! Applying the inequality gives us

\[\partial F(X_t) \leq - 2 \alpha F(X_t) \,.\]

At this point, it’s a direct application of Grönwall’s inequality to get

\[F(X_t) \leq e^{-2\alpha t} F(X_0) \,, \quad \forall t \geq 0 \,,\]

which is the desired exponential convergence result.

To summarize this section, I want to once again remark that this derivation is only straight forward because we had access to calculus. We will see in the next section that this is not always so clear, and how Otto calculus provides exactly what we need.

On the Convergence of Langevin Diffusion

A widely studied process in physics, statistics, and machine learning is the Langevin diffusion, defined by the following stochastic differential equation (SDE)

\[dX_t = - \nabla F(X_t) \, dt + \sqrt{2} \, dB_t \,,\]

where ${B_t}$ is a standard Brownian motion. Effectively, it’s a gradient flow (see equation 4) plus Brownian noise.

Now, there are many reasons why Langevin is nice, but one particularly helpful property is the stationary distribution is the Gibbs distribution, i.e.

\[\nu(dx) \propto e^{ -F(x) } \, dx \,.\]

It’s rare to be able to characterize the stationary distribution of a stochastic process in such an explicit and nice closed form, and we can use this to show many desirable properties. However, when the time $t$ is finite, how good is Langevin at approximating the stationary distribution?

This is a question of convergence, one that has been well studied over the years. In particular, we would like to understand the necessary and sufficient conditions for exponential convergence. In particular, we can consider the following result from the excellent reference by Bakry, Gentil, and Ledoux [BGL14].

Theorem (Exponential Decay in Entropy) Let $\mu_t = \mathscr{L}(X_t)$ be the time marginal distribution of Langevin diffusion, and let $H_\nu(\mu_t)$ denote the Kullback–Leibler (KL) divergence of $\mu_t$ with respect to the Gibbs distribution $\nu$. Then the exponential decay of KL with rate $2\alpha > 0$, i.e.

\[H_\nu(\mu_t) \leq e^{ - 2\alpha t} H_\nu(\mu_0) \,, \quad \forall t \geq 0 \,, \mu_0 \text{ such that } H_\nu(\mu_0) < \infty \,,\]

is equivalent to the logarithmic Sobolev inequality (LSI) with constant $\alpha > 0$, i.e.

\[H_\nu(\mu) \leq \frac{1}{2 \alpha} I_\nu(\mu) = \frac{2}{\alpha} \int \left\| \nabla \log \frac{d\mu}{d\nu} \right\|^2 d\mu_t \,, \quad \forall \mu \text{ such that } H_\nu(\mu) < \infty \,.\]

When I first read this result and the somewhat technical derivation behind the proof, it was daunting to say the least. More importantly, I didn’t know what to make of it. In other words, I couldn’t see the structure of this proof. It was one of those partial differential equation (PDE) proofs that just kind of “fell into place,” if you know what I mean.

However, the curious reader might have already noticed something interesting: the logarithmic Sobolev inequality looks quite similar to the PL-inequality. As we will see next, this is not a coincidence at all.

Wasserstein Gradient Flow

Finally, we come to the main result of this blog post: the famous Jordan–Kinderlehrer–Otto (JKO) theorem [JKO98]. In the simplest term possible, it says the following.

The time marginal $\mu_t = \mathscr{L}(X_t)$ of the Langevin diffusion is following the path of a certain gradient flow of the KL-divergence in the space of probability distributions.

This result directly led to the insight of Otto calculus: the space of probability distributions $\mathscr{P}$ is naturally equipped with a differentiable structure! Perhaps I’m a slow learner, but overtime I have come to increasingly appreciate this result.

To be precise, the gradient flow is defined by a continuous time limit of the proximal gradient operator:

\[\mu_{t+h} \approx \inf_\rho H_\nu(\rho) + \frac{1}{2h} W_2( \mu_t, \rho )^2 \,, \quad \text{ for small } h > 0 \,.\]

If this operation is performed on a Riemannian manifold, with $W_2$ replaced by the geodesic distance, this exactly recovers the Riemannian gradient flow. However, the surprising part of this result is that [JKO98] showed the limit solves the Fokker–Planck equation, i.e. the equation characterizing the law of Langevin diffusion

\[\partial_t \mu_t = \text{div} \left( \mu_t \nabla \log \frac{d\mu_t}{d\nu} \right) \,.\]

Therefore, from now on, we will use the following notation to denote the Fokker–Planck equation

\[\partial_t \mu_t = - \text{grad}_{W_2} H_\nu(\mu_t) \,.\]

In fact, the famous de Bruijn’s identity is straight forward consequence of Otto calculus

\[\partial_t H_\nu(\mu_t) = - \left\| \text{grad}_{W_2} H_\nu(\mu_t) \right\|_{W_2}^2 = - I_\nu(\mu_t) \,,\]

which directly mirrors the calculation in equation 6.

At this point, we can return to the logarithmic Sobolev inequality and recognize that this is exactly the PL-inequality using Otto calculus!

\[H_\nu(\mu) \leq \frac{1}{2 \alpha} \| \text{grad}_{W_2} H_\nu(\mu) \|_{W_2}^2 \,, \quad \forall \mu \text{ such that } H_\nu(\mu) < \infty\]

Well, at this point we will simply plug in Grönwall’s inequality once again to recover the desired exponential decay result

\[H_\nu(\mu_t) \leq e^{-2\alpha t} H_\nu(\mu_0) \,, \quad \forall t \geq 0\,, \mu_0 \text{ such that } H_\nu(\mu_0) < \infty \,.\]

Final Words

I think the most insight I gained through learning optimal transport is the fact that calculus is powerful. The drastic simplification of several well known PDEs using Wasserstein gradient flow is an incredible breakthrough, and we have just observed one important application of it for the Fokker–Planck equation. I hope this post already is sufficient motivation for “everyone” to learn optimal transport.

That being said, as I mentioned in the introduction, I intend this to be the first of two blog posts on this topic. In the next post, I intend to explore the geometric aspect of the Wasserstein manifold interpretation, and discuss an application to mean field neural networks.

Furthermore, I intentionally left out many important steps and details. In particular, the Wasserstein gradient can be computed explicitly. There are many nice references for optimal transport today that cover these details, but I strongly recommend [Vil03] as an accessible introduction.

References

[BGL14] Bakry, D., Gentil, I., & Ledoux, M. (2014). Analysis and geometry of Markov diffusion operators (Vol. 103). Cham: Springer.
[JKO98] Jordan, R., Kinderlehrer, D., & Otto, F. (1998). The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1), 1-17.
[Łoj63] Lojasiewicz, S. (1963). A topological property of real analytic subsets. Coll. du CNRS, Les équations aux dérivées partielles, 117(87-89), 2.
[Ott01] Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation.
[Pol63] Polyak, B. T. (1963). Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4), 643-653.
[Vil03] Villani, C. (2003). Topics in optimal transportation (Vol. 58). American Mathematical Soc.

An Unusally Clean Proof: Dyson Brownian Motion via Conditioning on Non-intersection

2021-06-26T00:00:00-04:00

Dyson Brownian motion [Dy62] is best known to characterize the eigenvalues of special random matrices [Ta12]. Most interestingly, it is also equal in distribution to $n$ independent Brownian motions conditioned to not intersect [Gr99]. In a topics course by Bálint Virág, I came across a proof of this result that is just too clean for this type of calculations. After picking up my jaw from the ground months later, I finally decided to write up this surprisingly elegant proof.

Background: Doob’s h-Transform

To compute the conditional dynamics of Markov processes, we will use the h-transform by Joseph Doob [Bl10]. Let us consider a time-homogeneous Markov process $\{ X(t) \}_{t \geq 0}$ to be conditioned on a shift invariant event $A$, i.e.

\[\mathbb{P}( \{ X(t) \}_{t\geq 0} \in A | X(0) = x ) = \mathbb{P}( \{ X(t+s) \}_{t\geq 0} \in A | X(s) = x ) \,.\]

An important example of a shift invariant event is the gambler’s ruin example, where $X(t) \in (0,c)$ is a martingale, and the event can be defined as

\[A := \left\{ X(t) \text{ hits } c \text{ before } 0 \right\} \,.\]

Here $X(t)$ is intended to model a gambler’s wealth process in a fair betting game, and it’s well known that $\mathbb{P}(A) = \frac{X(0)}{c}$ (a direct consequence of optional stopping theorem).

We will provide a simple sketch of the h-transform result (see [Bl10] for a rigorous proof). Here we introduce the following notations:

\[\begin{split} \mathbb{P}_x( \cdot ) &:= \mathbb{P}( \cdot | X(0) = x ) \,, \\ h(x) &:= \mathbb{P}_x(A) \,, \\ P^t(x, dy) &:= \mathbb{P}_x( X(t) \in dy ) \,, \\ \tilde{P}^t(x, dy) &:= \mathbb{P}_x( X(t) \in dy | A) \,, \end{split}\]

where $h(x)$ is the key transform function, and $P^t(x,dy)$ is the transition kernel, which completely characterizes the dynamics of a Markov process. Therefore our goal is to compute $\tilde{P}^t(x,dy)$, which we can just use Bayes’ rule

\[\begin{aligned} \tilde{P}^t(x, dy) &= \mathbb{P}_x( X(t) \in dy | A) \\ &= \frac{ \mathbb{P}_x(A | X(t) \in dy) \mathbb{P}_x( X(t) \in dy ) }{ \mathbb{P}_x(A) } \\ &= \frac{h(y)}{h(x)} P^t(x, dy) \,, \end{aligned}\]

where we used the shift invariance of $A$ to write $\mathbb{P}_x(A \vert X(t) \in dy) = h(y)$. In other words, the Radon–Nikodym derivative for the transition kernel is simply the ratio $h(y)/h(x)$!

To make calculations even simpler, we will also compute the effect on the infinitesimal generator, namely the operator $L$ defined as follows:

\[L[f](x) := \lim_{t \to 0} \frac{ \mathbb{E}_x f(X(t)) - f(x) }{t} \,, \quad \text{ if it exists, }\]

where we define $\mathbb{E}_x(\cdot) := \mathbb{E}( \cdot \vert X(0) = x)$.

We want to then compute

\[\begin{aligned} \tilde{L}f &:= \lim_{t \to 0} \frac{ \mathbb{E}_x( f(X(t)) | A ) - f(x) }{t} \\ &= \lim_{t \to 0} \frac{1}{t} \left( \int f(y) \tilde{P}^t(x,dy) - f(x) \right) \\ &= \lim_{t \to 0} \frac{1}{t} \left( \int f(y) \frac{h(y)}{h(x)} P^t(x,dy) - f(x) \right) \\ &= \frac{1}{h(x)} \lim_{t \to 0} \frac{1}{t} \left( \int (f(y) h(y) - f(x) h(x)) P^t(x,dy) \right) \\ &= \frac{1}{h(x)} L[fh](x) \,, \end{aligned}\]

where we used the h-transform result above and the definition of the generator.

At this point, we will recall a well known that $h$ is harmonic, i.e. $Lh = 0$, to save some calculations (I suspect the letter “h” in h-transform stands for “harmonic”). We also recall that for a diffusion process $dX(t) = \mu(X(t))\,dt + dB(t)$, the generator follows from Itô’s Lemma

\[L[f](x) := \langle \mu(x), \nabla f(x) \rangle + \frac{1}{2} \Delta f(x) \,.\]

Using the harmonic property, we have the following clean formula for the transformed generator

\[\tilde{L} [f](x) = \frac{ L[fh](x) }{h(x)} = \langle \mu(x) + \nabla \log h(x), \nabla f(x) \rangle + \frac{1}{2} \Delta f(x) \,,\]

which corresponds to the diffusion process

\[d\tilde{X}(t) = ( \mu(\tilde{X}(t)) + \nabla \log h(\tilde{X}(t)) ) \, dt + dB(t) \,.\]

To summarize, h-transform simply adds a drift term to the original process! Here we remark that although the above formula looks simple in terms of $h(x)$, the function $h(x)$ itself is often quite complicated, making this calculation at least convoluted if not completely intractable. This is why the proof coming out so clean in the next section is a huge surprise.

Dyson Brown Motion via Conditioning on Non-intersection

Here we will first state the main result.

Theorem Let $\{ \lambda_i(t) \}_{t\geq 0, i \in [n]}$ be the Dyson Brownian motions, i.e. $\lambda(t)$ satisfy the following stochastic differential equation (SDE)

\[d\lambda_i(t) = dB_i(t) + \sum_{j \neq i} \frac{dt}{ \lambda_i(t) - \lambda_j(t) } \,,\]

where the initial conditions satisfy $\lambda_1(0) > \lambda_2(0) > \cdots > \lambda_n(0)$, and {$B_i(t)$} are independent standard Brownian motions. Then we have the following equality in distribution

\[\{ \lambda_i(t) \}_{t \geq 0, i \in [n]} \overset{d}{=} \{ B_i(t) \}_{t \geq 0, i \in [n]} \vert A \,,\]

where we define $A := \{ \{B_i(t)\} \text{ do not intersect } \}$.

Before we start, we note the event of $n$ Brownian motions to not intersect for all time has zero probability. Then what does it even mean to condition on an event of zero probability? Well we would consider a collection of events $\{A_c\}_{c>0}$ converging to the null event $A$, such that $\mathbb{P}(A_c) > 0$ for all $c>0$, and we can compute dynamics of these Brownian motions in the limit as $c\to \infty$.

To define these events $A_c$, we will define the Vandermonde determinant:

\[\Delta_n( \lambda ) = \prod_{1 \leq i < j \leq n} ( \lambda_i - \lambda_j ) \,.\]

Here we observe that since $\lambda$ is sorted in decreasing order, we have that

\[\Delta_n( \lambda ) > 0 \iff \lambda_i \neq \lambda_j \quad \forall i \neq j \,.\]

Therefore, we can define the events $A_c := \{ \Delta_n( B(t) ) \text{ hits } c \text{ before } 0 \}$. Observe that we can indeed recover the non-intersection event $A$ in the limit

\[A = \lim_{c \to \infty} A_c \,.\]

Recalling the gambler’s ruin example, if $\Delta_n( B(t) )$ is a martingale, we have a very simple formula for the h-transform

\[h_c( x ) := \mathbb{P}_{x}( A_c ) = \frac{ \Delta_n(x) }{ c } \,.\]

Indeed we will first prove this result.

Lemma $\Delta_n(B(t))$ is a martingale.

proof (of Lemma): We will directly compute the SDE of $\Delta_n(B(t))$ using Itô’s Lemma

\[d \Delta_n(B(t)) = \frac{1}{2} \Delta \Delta_n(B(t)) \, dt + \cdots dB(t) \,,\]

where hide the diffusion term since the Itô integral with respect to a martingale is also a martingale. Therefore it’s sufficient to show the drift term is zero.

Using the identity

\[\frac{1}{(a-b)(a-c)} + \frac{1}{(b-a)(b-c)} + \frac{1}{(c-a)(c-b)} = 0 \,,\]

it’s a simple calculation to show that $\Delta \Delta_n(x) = 0$ via this symmetry, and hence the desired result follows.

\[\tag*{$\Box$}\]

proof (of Theorem): It remains to compute the h-transform dynamics, in particular the drift term

\[\begin{aligned} \partial_i \log h_c(x) &= \partial_i ( \log \Delta_n(x) - \log c ) \\ &= \frac{1}{\Delta_n(x)} \sum_{j \neq i} \frac{\Delta_n(x)}{ x_i - x_j } \,, \end{aligned}\]

which implies that conditioned on the event $A_c$, the h-transformed process satisfy the SDE

\[\begin{aligned} d \lambda_i(t) &= \partial_i \log h_c( \lambda(t) ) \, dt + dB_i(t) \\ &= \sum_{j \neq i} \frac{ dt }{ \lambda_i(t) - \lambda_j(t) } + dB_i(t) \,. \end{aligned}\]

Finally, to complete the proof, we observe the dynamics of $\lambda(t)$ is invariant to changes in $c>0$, therefore taking $c \to \infty$ recovers the unbounded dynamics of Dyson Brownian motion.

\[\tag*{$\Box$}\]

Final Words

That’s it! That’s the proof! Having played around with h-transforms before, and getting only ridiculously ugly expressions, it’s quite remarkable to me that this proof was able to avoid messy calculations all together.

What helped simplified this proof? To quote Bálint Virág: “[t]his is because the Vandermonde [determinant] is harmonic, so this fits into the h-transform language.” Indeed, we saw above that $\Delta \Delta_n(x) = 0$ played an important role in the calculations. So let this be a rule of thumb for future problems: try to define events using harmonic functions in h-transforms!

For those interested in random matrix theory, Terrence Tao wrote some very nice lecture notes on the applications of Dyson Brownian motion, which can also be found in his book [Ta12]. In particular, we can use Dyson Brownian motion to derive the eigenvalue density of a Gaussian unitary ensemble (GUE) matrix, common known as the Ginibre formula:

\[\rho(\lambda) = \frac{1}{(2\pi)^{n/2} 1! \cdots (n-1)!} e^{ -|\lambda|^2/2 } |\Delta_n(\lambda)|^2 \,,\]

which can then be used to derive the famous Wigner’s semicircle law in the limit $n \to \infty$.

References

[Bl10] Bloemendal, Alex. “Doob’s h-transform: theory and examples.” Lecture notes (2010).
[Dy62] Dyson, Freeman J. “A Brownian‐motion model for the eigenvalues of a random matrix.” Journal of Mathematical Physics 3.6 (1962): 1191-1198.
[Gr99] Grabiner, David J. “Brownian motion in a Weyl chamber, non-colliding particles, and random matrices.” Annales de l’IHP Probabilités et statistiques. Vol. 35. No. 2. 1999.
[Ta12] Tao, Terence. Topics in random matrix theory. Vol. 132. American Mathematical Soc., 2012.

On Escape Time, Lyapunov Function, Poincaré Inequality, and the KLS Conjecture Beyond Convexity

2021-01-13T00:00:00-05:00

Nobody has time to read an 80 page paper [LE20]. Therefore I doubt most readers realized the manifold Langevin algorithm paper actually contains a novel technique for establishing functional inequalities. And I really doubt anyone had time to interpret the intuitive consequences of such results on perturbed gradient descent, and definitely not extending the Kannan-Lovász-Simonovits (KLS) conjecture [LV18] - which brings me to write this blog post.

Background

Let us start with a potential function $F : \mathbb{R}^d \to \mathbb{R}$, an inverse temperature parameter $\beta > 0$, and we define the Gibbs density as

\[\nu(x) := \frac{1}{Z} e^{ -\beta F(x) } \,,\]

where $Z = \int e^{-\beta F(x)} dx$ is the normalizing constant.

We say $\nu$ satisfies the Poincaré inequality with constant $\kappa > 0$, denoted $\text{PI}(\kappa)$, if

\[\int f^2 \, d\nu - \left( \int f \, d\nu \right)^2 \leq \frac{1}{\kappa \, \beta} \int | \nabla f |^2 \, d\nu \,,\]

for all $f \in C^1(\mathbb{R}^d) \cap L^2(\nu)$. Note we adopt the convention of [BGL13] which adjusts the right hand side by a factor of $\beta$, and the two conventions agree when $\beta = 1$.

$\text{PI}(\kappa)$ is well known to be equivalent to exponential convergence of Langevin diffusion [BGL13, Theorem 4.2.5], quadratic-linear cost transport inequality [Vil08, Theorem 22.25], and Cheeger’s isoperimetric inequality [LV18, Theorem 11]. Furthermore, $\text{PI}(\kappa)$ also implies dimension free exponential concentration [Vil08, Theorem 22.32], and serves as a key tool for deriving existence, uniqueness, and smoothness results in partial differential equations [Eva10]. Therefore a tight lower bound for the Poincaré constant is widely desired for a large range of applications.

Interpreting the Poincaré Constant

Firstly, we will recall the (overdamped) Langevin diffusion is defined by the following stochastic differential equation (SDE)

\[dX_t = \underbrace{ - \nabla F(X_t) \, dt }_{ \text{gradient flow}_{} } + \underbrace{ \sqrt{ 2/\beta } \, dW_t }_{ \text{perturbation}_{} }\,,\]

where $\{W_t\}_{ t_{} \geq 0 }$ is a standard $d$-dimensional Brownian motion. Observe that when $\beta$ becomes large, the Brownian motion term becomes very small. Therefore Langevin diffusion can be interpreted as a perturbed gradient flow.

Since the Gibbs density $\nu$ finds the global minimum of $F$ as $\beta \to \infty$, a dimension and temperature free Poincaré constant implies a fast convergence of Langevin diffusion to the global minimum when $\beta$ is large. Therefore it is no surprise that

a strongly convex $F$ implies $\nu$ has a dimension and temperature free Poincaré constant, more famously known as the Bakry-Émery criterion [BGL13, Proposition 4.8.1];
a non-convex $F$ with multiple isolated local minima leads to a Poincaré with exponentially poor dependence on $\beta$, more famously known as the Eyring-Kramers formula [Ber11].

In other words, strongly convex functions are easy to optimize, general non-convex functions are hard, what else is new? What is new are the cases in between: non-strongly convex functions with a unique minimum. However, even when weakening to $F$ being only convex, this problem remains open - this is an equivalent formulation of the KLS conjecture [LV18].

Conjecture (Kannan-Lovász-Simonovits, Poincaré version) There exists a universal constant $\kappa > 0$, such that for all positive integer $d$, and all convex function $F: \mathbb{R}^d \to \mathbb{R}$ such that the Gibbs density $\nu(x) = \frac{1}{Z} e^{-F(x)}$ (note $\beta = 1$ here) has zero mean and identity covariance matrix, we have that $\nu$ satisfies $\text{PI}(\kappa)$.

I should briefly mention that a recent arXiv preprint [Che20] proposed a result equivalent to an almost constant lower bound on the Poincaré constant of order $O( d^{-o(1)} )$, which in the limit of $d\to\infty$ converges to $0$ slower than $d^{-r}$ for all $r>0$. For the sake of staying on topic, we will leave this extremely interesting subject for a future post.

Furthermore, it is already known that an adaptive perturbation of gradient descent escapes saddle points at a dimension free rate [JGN+17]. This hints at the possibility of establishing a dimension and temperature free Poincaré inequality for even non-convex potential functions! Indeed, we will discuss this next.

Non-Convex Poincaré

The main result of this blog post is actually an intermediate result of [LE20, Proposition 9.11]. While the original proposition is proved for a product manifold of spheres, it can be easily adapted to $\mathbb{R}^d$ with a containment type condition, see for example [Vil06, Theorem A.1] and [MS14, Assumption 1.4]. Since as of writing this post, there is no complete proof of this adaptation, we will state it as a claim.

Claim (Adapting [LE20, Proposition 9.11]) Suppose $F:\mathbb{R}^d \to \mathbb{R}$ have a unique local (and therefore global) minimum, and all saddle points are strict, i.e. the minimum eigenvalue $\lambda_{\text{min}}( \nabla^2 F ) < - \lambda$ for some constant $\lambda>0$ at saddle points. Then under appropriate containment conditions, and choosing $\beta$ sufficiently large, we have that $\nu(x) = \frac{1}{Z} e^{-\beta F(x)}$ satisfies $\text{PI}(\kappa)$ for a constant $\kappa>0$ independent of $\beta, d$.

As we discussed earlier, perturbation helps gradient descent escape saddle points, therefore this result is very intuitive. However, deriving a quantitative bound is completely non-trivial. We remark that for non-convex potentials, most approaches to establishing a Poincaré inequality will yield exponentially poor dependence on both $\beta$ and $d$. To our best knowledge, only [MS14] has established a bound independent of $\beta$, and (by our calculations) exponential in $d$.

Implications

Let us start by emphasizing this result does not imply the KLS conjecture. Indeed, the conditions of the conjecture does not require $F$ to have a unique minimum. Hence $F$ needs not to be strongly convex around the minimum. This is a key requirement of the proof technique.

It does, however, imply that the KLS conjecture can be extended beyond convex functions. More precisely, if a potential $F$ satisfies a dimension and temperature free Poincaré inequality, we can add saddle points to $F$ without losing this property. Therefore, we can replace convex potentials $F$ in the statement of the KLS conjecture with modifications of convex potentials $F$ with strict saddle points. In fact, this naturally leads us to further conjecture the strictness of saddle points can be relaxed as well, since it’s the parallel of relaxing strong convexity to convexity for saddle points.

Additionally, notice that we can take $\beta$ to be as large as we want. This implies that the amount of randomness added to gradient flow does not affect its ability to escape saddle points. In other words, any tiny amount of perturbation will help escape saddle points. Furthermore, we emphasize this implies a discretization of Langevin diffusion, i.e. a perturbed gradient descent, will also escape strict saddle points - this was the main result of [LE20]. This is in sharp contrast with (deterministic) gradient descent taking up to exponential time to escape a saddle point [DJL+17], implying the addition of noise, even arbitrarily small, fundamentally changes the behaviour of gradient descent.

Proof Sketch - Step 1: Lyapunov Function Away From Saddle Points

Now to my favourite part of this post, where we actually describe the proof techniques. We will see that despite the lengthy calculations in [LE20], the proof idea is quite straight forward to explain. We start by stating a Lyapunov criterion for the Poincaré inequality.

Theorem [BBCG08, Theorem 1.4 Adapted] Let $U \subset \mathbb{R}^d$ be such that $\nu$ restricted to $U$ satisfies $\text{PI}(\kappa_U)$. Suppose there exist constants $\theta > 0, b \geq 0$ and a function $V \in C^2(\mathbb{R}^d)$ such that $V \geq 1$ and

\[LV := \langle -\nabla F, \nabla V \rangle + \frac{1}{\beta} \Delta V \leq -\theta \, V + b {1}_{U_{}} \,,\]

Then $\nu$ satisfies $\text{PI}(\kappa)$ with constant

\[\kappa = \frac{ \theta }{ 1 + b / \kappa_U} \,.\]

Intuitively, we can think of the Lyapunov function $V$ as an energy measure of the Langevin diffusion $\{X_t\}_{t_{} \geq 0}$, $LV$ as the time evolution of $V$ via Itô’s Lemma, and the Lyapunov condition (inequality) describes the rate of energy dissipation over time. This energy $V$ will decrease as $X_t$ gets closer to $U$, hence $U$ behaves like an attractor. Once $X_t$ reaches $U$, the process begins to “mix” due to the Poincaré inequality on $U$. In our case, we will choose $U$ to be a small neighbourhood of the global minimum, and use the strong convexity (Bakry-Émery criterion) to get a Poincaré constant $\kappa_U$. For those that find this description familiar, indeed this is the diffusion equivalent of the drift and minorization conditions for Markov chain mixing [MT09].

Similar to other Lyapunov function based methods in differential equations, constructing such a function $V$ is the main difficulty. [MS14] observed that when $F$ has only strict saddle points, the choice of $V = \exp\left( \frac{\beta}{2} F \right)$ works very nicely away from saddle points. In fact, we can directly compute the Lyapunov condition

\[\frac{LV}{V} = \frac{1}{2} \Delta F - \frac{\beta}{4} |\nabla F|^2 \,,\]

and observe that as long as $|\nabla F|$ is bounded away from zero, we can choose $\beta$ to be large, hence forcing $\frac{LV}{V}$ to be negative. In other words, excluding small neighbourhoods around saddle points, $\nu$ satisfies a Poincaré inequality.

The precise version of this result can be found in [LE20, Lemma 9.10]. In particular, we can compute this constant to be dimension and temperature free.

Proof Sketch - Step 2: An Escape Time Construction

Now we have narrowed the problem down only constructing a Lyapunov function for the neighbourhoods around saddle points, we will observe replacing the inequality of the Lyapunov condition with equality gives us the Poisson equation. Consequently, we have a stochastic representation of the solution in terms of the escape time.

Theorem [BdH16, Theorem 7.15 Adapted] [LE20, Corollary 9.3] Let $B \subset \mathbb{R}^d$, $\{X_t\}_{t_{}\geq 0}$ be the Langevin diffusion, and $\tau_{B^c}$ be the first escape time of $X_t$ from $B$. Suppose there exists a constant $\theta>0$ such that

\[V(x) := \mathbb{E} [ \, \exp( \theta \, \tau_{B^c} ) \, | \, X_0 = x \, ] < \infty \,, \quad \forall x \in B \,,\]

then $V$ is the unique solution to the Poisson equation

\[\begin{split} LV &= - \theta \, V \,, \quad & x \in B \,, \\ V &= 1 \,, \quad & x \in \partial B \,. \end{split}\]

Readers with a Markov chain background may recognize this escape time based condition to be equivalent to drift and minorization [DMPS18, Theorem 14.1.3]. In fact, this method was inspired by the nice connection drawn between diffusions and Markov chains.

Additionally, readers familiar with concentration inequalities may recognize the theorem’s condition is known as exponential integrability, and it’s one of the equivalent characterizations for sub-exponential random variables. Indeed, we will actually use a slightly easier equivalent form for calculations.

Theorem [Wai19, Theorem 2.13] For a zero mean random variable $\tau$, the following are equivalent:

there exists a constant $\theta > 0$ such that $\mathbb{E} \, e^{ \theta \tau } < \infty$,
there exists constants $c,\theta > 0$ such that $\mathbb{P} [ \, |\tau| > t \, ] \leq c e^{ -2\theta t }$ for all $t \geq 0$.

At this point, it’s then sufficient to establish an exponentially decaying tail bound for $\tau_{B^c}$. To this goal, we will make several observations:

When $F$ is sufficiently smooth, $F$ can be locally approximated by a quadratic function.
We only need to consider an escape via a direction corresponding to a negative eigenvalue of $\nabla^2 F$.
When restricted within this direction only, $F$ near a saddle point can be viewed as a local maximum.

To illustrate this point clearly, let us consider the quadratic function $f(x,y) = x^2 - \frac{\lambda}{2} y^2$ with a saddle point at $(x,y) = (0,0)$. For the Langevin diffusion to escape a neighbourhood of radius $r>0$, it’s sufficient to ensure the $y$-component exceeds $r>0$. Therefore, it’s sufficient to restrict $f$ to only its $y$-component, which makes $y=0$ a local maximum. Therefore it’s sufficient to study the Langevin diffusion restricted to escaping an one dimension local maximum, i.e.

\[dX_t = \lambda X_t \, dt + \sqrt{ 2/\beta } \, dW_t \,,\]

where $-\lambda$ upper bounds the smallest eigenvalue of $\nabla^2 F$ at saddle points.

We observe that this SDE is the “negative” Ornstein-Uhlenbeck process, and it has a closed form solution

\[X_t = X_0 e^{\lambda t} + \sqrt{2/\beta} \, \int_0^t e^{\lambda(t-s)} dW_s \,,\]

which corresponds to $X_t \sim N( X_0 e^{\lambda t} \,, \frac{1}{\lambda \beta}(e^{2\lambda t} - 1) )$. Finally, plugging in the Gaussian density and a few calculations later, we get the desired result of

\[\mathbb{P} [ \, |\tau_{B^c}| \geq t \, ] \leq \mathbb{P} [ \, X_t \in B \, ] \leq c e^{ -\lambda t } \,,\]

where the constant $c$ does not depend on $t$. I.e. this escape time tail bound implies that $V(x)$ is a valid Lyapunov function, and hence implies a Poincaré inequality.

Final Thoughts

Quite a few technical details were swept under the rug to simplify the proof sketch, as the reader might expect. Probably the most significant is the approximation of $F$ by a quadratic function - it is actually not very straight forward to connect an approximation bound to an escape time bound.

At the same time, the requirement of $\beta$ to be sufficiently large is quite unsatisfying. Intuitively, why would adding noise hurt the mixing of a Markov process? It feels to me that this condition is merely a technical constraint, and a more careful analysis could sharpen or remove this condition. Hopefully the readers will have more thoughts and ideas than I do.

Thanks to reading up this point, and I wish everyone a happy new year!

References

[BBCG08] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin, A simple proof of the poincar ́e inequality for a large class of probability measures, Electronic Communications in Probability 13 (2008), 60–66.
[Ber11] N. Berglund, Kramers’ Law: Validity, Derivations and Generalisations. arXiv preprint arXiv:1106.5799 (2011).
[BGL13] D. Bakry, I. Gentil, and M. LeDoux, Analysis and Geometry of Markov Diffusion Operators, Springer (2013).
[Che20] Y. Chen, An Almost Constant Lower Bound of the Isoperimetric Coefficient in the KLS Conjecture, arXiv preprint arXiv:2011.13661 (2020).
[DJL+17] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, B. Poczos, Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pp. 1067-1077 (2017).
[DMPS18] Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier, Markov chains, Springer, 2018.
[Eva10] L.C. Evans, Partial differential equations, Graduate studies in mathematics, American Mathematical Society (2010).
[JGN+17] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, & M.I. Jordan, How to escape saddle points efficiently, arXiv preprint arXiv:1703.00887 (2017).
[LE20] M.B. Li, M.A. Erdogdu, Riemannian Langevin Algorithm for Solving Semidefinite Programs, arXiv preprint arXiv:2010.11176 (2020).
[LV18] Y.T. Lee, and S.S. Vempala, The Kannan-Lovász-Simonovits Conjecture, arXiv preprint arXiv:1807.03465 (2018).
[MS14] Georg Menz and André Schlichting, Poincaré and Logarithmic Sobolev Inequalities by Decomposition of the Energy Landscape, The Annals of Probability 42 (2014), no. 5, 1809–1884.
[MT09] Sean Meyn and Richard L. Tweedie, Markov chains and stochastic stability, 2nd ed., Cambridge University Press, USA, 2009.
[Vil06] C. Villani, Hypocoercivity, arXiv preprint math/0609050 (2006).
[Vil08] C. Villani, Optimal Transport: Old and New, Grundlehren der Mathematischen Wissenschaften, Springer Berlin Heidelberg (2008).
[Wai19] Martin J Wainwright, High-dimensional statistics: A non-asymptotic viewpoint, vol. 48, Cambridge University Press, 2019.

The Auffinger-Chen Representation

2019-02-13T00:00:00-05:00

Equivalent representation results contribute not only a connection between different concepts, but also a new set of proof techniques. Indeed, stochastic analysis has offered a number of alternative proofs to many problems. Occasionally the proof can simplify drastically. In this post, we will discuss a particularly elegant application by Auffinger and Chen (2015), for an otherwise very difficult problem in spin glass.

The Problem Statement

For the sake of writing a self-contained blog post, we will not attempt to provide a description of spin glass models. Instead, we will state the problem in the most mathematically interesting form, without explaining where the quantities came from.

Let $\xi:[0,1]\to \mathbb{R}$ be twice differentiable and strictly increasing and strictly convex (i.e. $\xi', \xi'' > 0$), also let $\zeta:[0,1] \to [0,1]$ be a cumulative distribution function (CDF). We will consider the Parisi partial differential equation (PDE) defined as follows

\[\begin{cases} \partial_t \Phi = \frac{ - \xi''(t) }{2} \left[ \partial_{xx} \Phi + \zeta(t) \left( \partial_x \Phi \right)^2 \right], \\ \Phi(1,x) = \log \cosh(x) \,, \end{cases}\]

where the time derivative is defined by the right limit for when $\zeta(t)$ is discontinuous.

It is well known that we can solve this PDE backwards in time using a Hopf-Cole transformation; in fact, we will provide a sketch in a a later section. This allows us to state an optimization objective as follows:

\[\inf_{\zeta} \Phi_\zeta(0,x),\]

where we are minimizing over the set of all CDFs on $[0,1]$ for each $x \in \mathbb{R}$. Finally we can state the question as follows:

Question Does there exist a unique minimizer to the optimization problem $\inf_{\zeta} \Phi_\zeta(0,x)$ for each $x\in\mathbb{R}$?

The main difficulty comes from the unclear dependence on $\zeta$, even if we can write down a closed form solution to the Parisi PDE. At the very least, it would be extremely unpleasant and tedious to work with. Additionally, we remark that the problem is already stated in a simplified form, as opposed to the original framing in spin-glass.

Before we jump into the main results, we observe that existence of a minimizer is straight forward to prove. Since we are restricted to the domain $[0,1]$, any sequence of probability measures is tight. It is then sufficient to consider any sequence of probability measures $\{\zeta_n\}$ that minimizes $\Phi_\zeta(0,x)$, and tightness implies there exist a converging subsequence such that $\zeta_{n_k} \to \zeta^*$ weakly, which is a minimizer of $\Phi(0,x)$.

The Auffinger-Chen Representation

To complete the proof, it is sufficient to show $\Phi(0,x)$ is strictly convex in $\zeta$. In this section, we will use a stochastic representation to show convexity, which is the main difficulty of the problem. Readers unfamiliar with stochastic analysis can find a brief introduction in a previous blog post, in particular we will use Itô’s Lemma in the upcoming proofs.

We start by defining $B_t := W_{\xi'(t)}$, where $\{W_t\}$ is a standard Brownian motion. Let $\{\mathscr{F}_t\}_{t\geq 0}$ be $\{W_t\}$’s canonical filtration, and then we define a collection of processes

\[\mathscr{D} := \left\{ (u_t)_{0 \leq t \leq 1} : u_t \text{ is adapted to } \mathscr{F}_t, |u_t| \leq 1 \right\}.\]

For simplicity of notation, we will write $\sigma(t) = \sqrt{\xi''(t)}$ for this section. At this point we will state the main result.

Theorem (Auffinger-Chen Representation) For all $\zeta$ a probability distribution on $[0,1]$, we have the following

\[\begin{split} \Phi(0,x) = \max_{u \in \mathscr{D}} \bigg[ \mathbb{E} & \Phi\left(1, x + \int_0^1 \sigma^2(s) \, \zeta(s) \, u_s \, ds + \int_0^1 \sigma(s) \, dW_s \right) \\ &- \frac{1}{2} \int_0^1 \sigma^2(s) \, \zeta(s) \, \mathbb{E} u_s^2 \, ds \bigg]. \end{split}\]

In particular, we have the maximizer is unique, and is given by $u_s = \partial_x \Phi(s, x + X_s)$, where $X_s$ is the strong solution of the following stochastic differential equation (SDE)

\[dX_s = \sigma^2(s) \, \zeta(s) \, \partial_x \Phi(s, x + X_s) \, ds + \sigma(s) \, dW_s, \quad X_0 = 0.\]

Remark Before we begin the proof, we will observe that $\Phi(0,x)$’s convexity follows directly from this representation. Firstly both integral terms containing $\zeta$ are linear in $\zeta$. Since $\Phi(1,x) = \log \cosh (x)$ is convex in $x$, we have the $\Phi$ term is convex in $\zeta$. Next the expectation over the sum of two convex functions remain convex. Finally, a maximum (or supremum) over convex functions remain convex, proving the desired convexity result!

Before we start, we will state several technical (but not difficult to prove) Lemmas. To guarantee a strong solution of the SDE, it is sufficient to have $\partial_x \Phi(s, x)$ be Lipschitz in $x$. We will omit the proof of these results as they are not important to the main goal of this blog post. Instead we will state the following Lemma containing the desired estimates.

Lemma (Derivative Estimates) For all $\zeta$ probability distributions on $[0,1]$, we have that

\[|\partial_x \Phi(t, x)| \leq 1, |\partial_{xx} \Phi(t,x)| \leq 1.\]

Another important result we will omit is the continuity of $\Phi$ in $\zeta$.

Lemma (Lipschitz in $L^1$) For any discrete distributions $\zeta_1, \zeta_2$, and for all $k \in \mathbb{N}$, we have that

\[\begin{split} \left| \Phi_{\zeta_1} - \Phi_{\zeta_2} \right| &\leq \xi''(1) \int_0^1 |\zeta_1(t) - \zeta_2(t)| dt, \\ \left| \partial_x^k \Phi_{\zeta_1}(t,x) - \partial_x^k \Phi_{\zeta_2}(t,x) \right| &\leq c_k \, \xi''(1) \int_0^1 |\zeta_1(t) - \zeta_2(t)| dt. \end{split}\]

Since we can approximate any distributions in $L^1$ by discrete distributions, then we can extend the definition of $\mathscr{P}(\cdot)$ and $\Phi(t,x)$ to all distributions by continuity. Therefore it is sufficient to prove the result for only finitely supported distributions.

proof (of the Auffinger-Chen representation): The proof will be a straight forward application of Itô’s Lemma, and the results follow almost directly from invoking the Parisi PDE.

We start with discrete $\zeta$, i.e $\zeta$ is a piecewise constant function. Let $u \in \mathscr{D}$, and define

\[dX_s := \sigma^2(s) \, \zeta(s) \, u_s \, ds + \sigma(s) \, dW_s, \quad X_0 = 0,\]

and let $Y_s := \Phi(s, x + X_s)$. Then we observe that

\[X_1 = \int_0^1 \sigma^2(s) \, \zeta(s) \, u_s \, ds + \int_0^1 \sigma(s) \, dW_s\]

appears exactly inside the first $\Phi$ term of the Auffinger-Chen representation.

At this point we adopt concise notation and write $\Phi := \Phi(s, x + X_s)$, and apply Itô’s Lemma to $Y_s$ to get

\[dY_s = \left[ \partial_s \Phi + \sigma^2(s) \, \zeta(s) \, u_s \, \partial_x \Phi + \frac{1}{2} \sigma^2(s) \, \partial_{xx} \Phi \right] ds + \sigma(s) \partial_x \Phi \, dW_s.\]

Here we note while the time derivative $\partial_s \Phi$ does not exist at finitely many points, we will eventually only use it in integral form. Using the Parisi PDE at points of continuity, we can make the following substitution

\[\partial_s \Phi + \frac{1}{2} \sigma^2(s) \partial_{xx} \Phi = - \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (\partial_x \Phi)^2.\]

We will make the substitution and complete the square to get

\[\begin{split} dY_s &= \left[ \sigma^2(s) \, \zeta(s) \, u_s \, \partial_x \Phi - \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (\partial_x \Phi)^2 \right] ds + \sigma(s) \, \partial_x \Phi \, dW_s \\ &= \left[ \frac{1}{2} \sigma^2(s) \, \zeta(s) \, u_s^2 - \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (u_s - \partial_x \Phi )^2 \right] ds + \sigma(s) \, \partial_x \Phi \, dW_s. \end{split}\]

Next we write this equation as an integral over $[0,1]$, and taking expectation to remove the martingale term we get

\[\begin{split} \mathbb{E} \Phi(1, x + X_1) - \Phi(0,x) =& \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \, \mathbb{E} u_s^2 \, ds \\ &- \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \, \mathbb{E} (u_s - \partial_x \Phi)^2 ds. \end{split}\]

Since $\Phi, \partial_x \Phi$ are continuous in $\zeta$, we can extend this equation to all $\zeta$. Furthermore, since the second integral is always positive, we must have the inequality

\[\Phi(0,x) \geq \mathbb{E} \Phi(1, x + X_1) - \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \, \mathbb{E} u_s^2 \, ds,\]

and the inequality must be strict unless $u_s = \partial_x \Phi$ almost surely.

Observe this proves the inequality of the representation. Since $|\partial_x \Phi| \leq 1$, we have $u_s = \partial_x \Phi \in \mathscr{D}$, hence achieving the equality in the representation.

\[\tag*{$\Box$}\]

Sketch of Strict Convexity

At this point, the author believes the goal of the blog post is already achieved: we have demonstrated the key technique with only very basic manipulations. That being said, to complete the story, we will provide a short sketch on how to prove strict convexity - hence proving there is a unique minimizer of $\Phi_\zeta(0,x)$.

We once again start with a key technical lemma.

Lemma (Strict Convexity in $x$) For all $\zeta$ a probability distribution on $[0,1]$, and for all $s \in [0,1]$, we have

\[\partial_{xx} \Phi(s,x) > 0.\]

Here we remind the reader that strict convexity in $x$ does not directly imply strict convexity in $\zeta$. We could just take this result for granted, but there is a nice proof using the Hopf-Cole transform and another stochastic representation, so why not?

sketch (of Lemma): Since $\Phi(t,x)$ is continuous in $\zeta$, we will only consider a discrete $\zeta$. Then using an appropriate time change and time reversal, we can get a new PDE

\[\partial_t \Phi = \frac{1}{2 \widehat \zeta(t)} \partial_{xx} \Phi + \frac{1}{2} (\partial_x \Phi)^2,\]

with initial conditions (as opposed to terminal conditions) $\Phi(0,x)=\log \cosh(x)$, and $\widehat \zeta(t) = \zeta(1 - t)$ changed due to time reversal. To simplify the PDE, we use the Hopf-Cole transformation to substitute $\phi = \exp\left( \frac{\Phi}{\widehat\zeta(t)} \right)$, which leads to the simplified linear PDE

\[\partial_t \phi = \frac{1}{2 \widehat \zeta(t)} \partial_{xx} \phi,\]

with initial conditions $\phi(0,x) = \frac{1}{\widehat\zeta(t)} \log \exp\left( \frac{\log \cosh(x)}{\widehat \zeta(t)} \right) = \cosh(x)^{1/\widehat \zeta(t)}$. Using another time change, we can also remove the $\widehat \zeta(t)$ above.

Here we can use any of the reader’s favourite method: the Feynman-Kac Representation, the Kolmogorov backward equation, or the heat kernel to write

\[\phi(t,x) = \mathbb{E} \cosh( x + W_t )^{1/\widehat \zeta(t)},\]

where $W_t$ is a standard Brownian motion, and $\widehat \zeta$ is constant in $[0,t)$. At this point it is sufficient to show strict convexity for this $t$, since we can piece together the constant intervals later. To this end, we will write

\[\Phi(t,x) = \frac{1}{\widehat \zeta(t)} \log \mathbb{E} \cosh(x + W_t)^{1/\widehat \zeta(t)},\]

and define

\[\langle f(W_t) \rangle := \frac{ \mathbb{E} f(W_t) \cosh(x + W_t)^{1/\widehat \zeta(t)} }{ \mathbb{E} \cosh(x + W_t)^{1/\widehat \zeta(t)} } \, ,\]

where we observe since $\cosh(x) > 0$ and $\mathbb{E}\cosh(x + W_t) < \infty$, we have that $\langle \cdot \rangle$ defines a new probability measure. In particular, we have Jensen’s inequality.

With this we can take the second derivative of $\Phi$ to get

\[\begin{split} \partial_{xx} \Phi(t,x) &= \frac{-1}{\widehat \zeta(t)} \left\langle \frac{1}{\widehat \zeta(t)} \tanh(x + W_t) \right\rangle^2 \\ & \quad+ \frac{1}{\widehat \zeta(t)} \left\langle \left( \frac{1}{\widehat \zeta(t)} \tanh(x + W_t) \right)^2 + \frac{1}{\widehat \zeta(t)} \left( 1 - \tanh(x + W_t)^2 \right) \right\rangle \\ &\geq \frac{1}{\widehat \zeta(t)} \left\langle \frac{1}{\widehat \zeta(t)} \left( 1 - \tanh(x + W_t)^2 \right) \right\rangle \\ &> 0 \, , \end{split}\]

where we used Jensen’s inequality and the fact that $\tanh(x)^2 < 1$.

Finally we return to strict convexity in $\zeta$.

sketch (of Strict Convexity in $\zeta$): We will start by introducing quantities related to convexity. Let $\zeta_1 \neq \zeta_2$, and let $\zeta = \lambda \zeta_1 + (1-\lambda) \zeta_2$ for some $\lambda \in (0,1)$. Recalling $\Phi(1,x) = \log \cosh (x)$, and using the optimal $u_s = \partial_x \Phi_\zeta(s, x + X_s)$, where $X_s$ defined with respect to $\zeta$. Note this $u_s$ is not necessarily optimal for $\zeta_1, \zeta_2$.

Since $\log\cosh(x)$ is convex, we can write

\[\Phi_\zeta(0,x) \leq \lambda_1 A_1 + \lambda_2 A_2,\]

where each $A_i$ is defined as

\[\begin{split} A_i := \mathbb{E}& \, \log \cosh \left( x + \int_0^1 \sigma^2(s) \, \zeta_i(s) \, u_s \, ds + \int_0^1 \sigma(s) dW_s \right) \\ & - \frac{1}{2} \int_0^1 \sigma^2(s) \, \zeta_i(s) \, \mathbb{E} u_s^2 \, ds. \end{split}\]

Since $\log \cosh(x)$ is strictly convex, the inequality is strict unless

\[\int_0^1 \sigma^2(s) \, \zeta_1(s) \, u_s \, ds = \int_0^1 \sigma^2(s) \, \zeta_2(s) \, u_s \, ds,\]

almost surely. Using the Auffinger-Chen representation, we have that $A_i \leq \Phi_{\zeta_i}(0,x)$. Therefore to prove the convexity is strict, it is sufficient to prove a gap in the first inequality, which is equivalent to saying that

\[Z := \int_0^1 \sigma^2(s) \, (\zeta_1(s) - \zeta_2(s)) \, u_s \, ds\]

has positive variance. The variance can be computed as

\[\text{Var}(Z) = \int_0^1 \int_0^1 \varphi(s) \, \varphi(t) \, \text{Cov}(u_s, u_t) \, ds dt,\]

where $\varphi(s) = \sigma^2(s) \, (\zeta_1(s) - \zeta_2(s))$.

While we omit the technical details, it’s not hard to believe $u_s = \partial_x \Phi(s, x + X_s)$ satisfy the following SDE (from Itô’s Lemma and differentiating the Parisi PDE)

\[du_s = \sigma(s) \partial_{xx} \Phi(s, x + X_s) dW_s.\]

Observing that $u_s$ is a martingale with independent increments, we can compute $\text{Cov}(u_s, u_t)$ as

\[\text{Cov}(u_s, u_t) = \text{Var}(u_{s \wedge t}) = \int_0^{s \wedge t} \sigma^2(v) \mathbb{E} (\partial_{xx} \Phi(v, x + X_v))^2 dv,\]

where the last step followed from Itô’s Isometry. Defining $\tau(s) := \text{Var}(u_s)$, we can also write $\text{Cov}(u_s, u_t) = \tau(s) \wedge \tau(t)$. With a bit of algebra we can derive

\[\text{Var}(Z) = \int_0^1 \left( \int_v^1 \varphi(s) ds \right)^2 \tau'(v) dv.\]

Since $\tau'(v) = \sigma^2(v) \mathbb{E} (\partial_{xx} \Phi(v, x + X_v))^2$, the desired result follows from the fact $\partial_{xx} \Phi > 0$.

Final Comments

Recall the original problem of $\inf_\zeta \Phi_\zeta(0,x)$. We have shown that while the dependence structure is unclear, we are able to prove its convexity with it easily using a stochastic representation. The author would like to point out that most techniques used here are quite basic, which is surprising for an originally very difficult problem.

The author would also like to point to a more general variational stochastic representation by Boué and Dupuis (1998), perhaps more useful for other applications.

Finally the post would not be possible without attending an excellent graduate course on spin glass taught by Dmitry Panchenko, where he has done a much better job explaining this topic. In particular, Dmitry has written an excellent book (Panchenko, 2013) with a bonus chapter covering this topic that can be found online. I would also highly recommends Dmitry’s notes on probability theory, which has been in general very helpful to the author’s studies and research.

References

Auffinger, A., & Chen, W. K. (2015). The Parisi formula has a unique minimizer. Communications in Mathematical Physics, 335(3), 1429-1444.
Boué, M., & Dupuis, P. (1998). A variational representation for certain functionals of Brownian motion. The Annals of Probability, 26(4), 1641-1659.
Panchenko, D. (2013). The Sherrington-Kirkpatrick model. Springer Science & Business Media.

Stone-Weierstrass and an Alternative Proof of Itô’s Lemma

2018-07-15T00:00:00-04:00

In a similar sense to line integrals, stochastic calculus extends the classical tools to working with stochastic processes. One of the most elegant and useful result is the change of variable formula for stochastic integrals, commonly known as Itô’s Lemma (see end of this post for a discussion on Doeblin’s contribution). While this lemma is quite easy to use, the proof usually relies heavily on technical lemmas, hence difficult to develop intuition, especially for the first time reader.

With this motivation in mind, it was quite pleasant to discover a set of excellent lecture notes by Jason Miller (2016), which contained an alternative proof built on the idea of Stone-Weierstrass Theorem. We shall see that not only do we have a more interpretable proof, the technique is also generalizable beyond stochastic calculus. In particular, this blog post intends to illustrate the technique in detail through Itô’s Lemma.

A Brief Background on Stochastic Calculus

We will introduce (without too much rigour) some basic definitions and results to support the proofs in later sections. The reader need not to carefully analyze the technical details here to understand the proofs to come. Readers familiar with stochastic calculus may skip to the next section.

First we let $(\Omega, \mathscr{F}, \{\mathscr{F}_t\}_{t\geq 0}, \mathbb{P})$ be a probability space equipped with a filtration (also satisfying the usual conditions to be rigorous). With this we can define several useful objects.

Definition A stochastic process $X := \{X_t\}_{t\geq 0}$ is said to be a martingale if

(i) $\forall t \geq 0$, we have $X_t$ is measurable with respect to $\mathscr{F}_t$, denoted $X_t \in \mathscr{F}_t$;

(ii) $\forall 0 \leq s \leq t$, we have $\mathbb{E}[ X_t | \mathscr{F}_s ] = X_s$ a.s.

Definition We say a random variable $\tau:\Omega \to [0,\infty]$ is a stopping time if $\forall t \geq 0, \{\tau \leq t \} \in \mathscr{F}_t$.

An important property of stopping time is that if $X_t$ is a martingale and $\tau$ a stopping time, then $X_{t \wedge \tau}$ is also a martingale.

Definition Let the interval $[0,T]$ be partitioned using increments of $2^{-n}$, i.e. $\{t_k^n\}_{k=0}^{\lceil T 2^n \rceil}$, where $t_k^n = k 2^{-n} \wedge T$. Let $X_t$ be a continuous martingale, and $f_t$ be a continuous (possibly stochastic) process. We define the Itô integral as

\[ \int_0^T f_t \, dX_t := \lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor} f_{t_k^n} (X_{t_{k+1}^n} - X_{t_k^n}), \]

if the limit converges u.c.p. (uniformly on compact intervals in probability to be precise).

Remark Observe the above definition uses a left Riemann sum to define the integral, where as other choices will lead to different integrals. This is opposed to deterministic integrals, where the all choices are equivalent.

Definition Consider the same partition $\{t_k^n\}$ as above. Let $M,N$ be two continuous martingales, we define the quadratic covariation as

\[ [M,N]_T := \lim_{n\to\infty} [M,N]^n_T := \lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor} (M_{t_{k+1}^n} - M_{t_k^n}) (N_{t_{k+1}^n} - N_{t_k^n}), \]

where the limit is also u.c.p. We also define the quadratic variation as $[M]_T := [M,M]_T$.

Several useful results are stated next.

Proposition (Finite Variation) Let $X,Y$ be continuous stochastic processes such that $X$ has finite variation, i.e.

\[\lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor} | X_{t_{k+1}^n} - X_{t_k^n} | < \infty,\]

and $[Y]_t > 0$ a.s. Then we have

\[[X,Y]_t = 0 \;\text{a.s.}\]

Proposition (Itô’s Product Rule) Let $X,Y$ be continuous martingales, then we have

\[X_t Y_t - X_0 Y_0 = \int_0^t X_s dY_s + \int_0^t Y_s dX_s + [X,Y]_t \,.\]

Proposition (Fundamental Theorem) Let $X,Y,Z$ be continuous martingales, then we have

\[\int_0^t X_s d\left( \int_0^s Y_u dZ_u \right) = \int_0^t X_s Y_s dZ_s.\]

Proposition (Kunita-Watanabe Identity) Let $X,Y,Z$ be continuous martingales, then we have

\[\left[ \int_0 X_s dY_s, Z \right]_t = \int_0^t X_s d[Y,Z]_s,\]

where both uses of $[\;,\;]_t$ denotes the covariation.

Proposition (Itô’s Isometry)

Let $M$ be a continuous martingale, and $H$ be a continuous stochastic process. Then we have

\[\mathbb{E} \left[ \left( \int_0^t H_s dM_s \right)^2 \right] = \mathbb{E} \int_0^t H_s^2 d[M]_s.\]

The Lemma and the Classical Approach

For the purpose of the blog post, we will only state and prove a much simpler version of the lemma, but it is not difficult to adapt to more general conditions.

Theorem (Itô’s Lemma) Let $X_t$ be a continuous martingale, and $f \in C^2(\mathbb{R})$. Then we have

\[ f(X_t) = f(X_0) + \int_0^t \frac{\partial f}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t \frac{\partial^2 f}{\partial x^2} (X_s) d[X]_s. \]

Here we will sketch the proof from Karatzas and Shreve (1991).

proof sketch: We start by defining a stopping time $\tau_r := \inf \{t \geq 0 : |X_t| + [X]_t > r\}$, and replace $X_t$ with $X_{t \wedge \tau_r}$. This localization technique will allow us to only consider the function $f$ in the interval $B_r := [-r, r]$ (or a ball in higher dimensions), which has bounded derivatives.

By observing the lemma’s statement, the reader may notice the formula appears like the second order Taylor expansion of $f(X_t)$. Indeed we can write

\[\begin{align*} f(X_t) - f(X_0) =& \lim_{n\to\infty} \sum_{k=0}^{\lfloor t 2^n \rfloor} f(X_{t_{k+1}^n}) - f(X_{t_{k}^n}) \\ =& \lim_{n\to\infty} \sum_{k=0}^{\lfloor t 2^n \rfloor} \Big\{ \frac{\partial f}{\partial x}(X_{t_{k}^n}) [X_{t_{k+1}^n} - X_{t_{k}^n}] \\ &+ \frac{1}{2} \frac{\partial^2 f}{\partial x^2} (\eta_k^n) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2 \Big\}, \end{align*}\]

where $\eta_k^n \in [X_{t_{k}^n}, X_{t_{k+1}^n}]$ is chosen as part of Taylor’s theorem to satisfy the above equality. It’s not difficult to see the first sum converges to the first stochastic integral, then it remains to show the second term converges.

To this goal, we will define

\[\begin{align*} J_1^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor} \frac{\partial^2 f}{\partial x^2} (\eta_k^n) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2, \\ J_2^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor} \frac{\partial^2 f}{\partial x^2} (X_{t_{k}^n}) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2, \\ J_3^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor} \frac{\partial^2 f}{\partial x^2} (X_{t_{k}^n}) \{ [X]_{t_{k+1}^n} - [X]_{t_{k}^n} \}, \end{align*}\]

where observe $J_3^n$ converges to the desired integral. Next we will use the following technical inequality. Let $|X_s| \leq K < \infty, \forall s \leq T$ be a martingale, then we have

\[\mathbb{E} ([X]^n_T)^2 \leq 6 K^4.\]

Without stating the details, using this and Cauchy-Schwarz inequality, we can show

\[\lim_{n\to\infty} |J_1^n - J_2^n| = 0 \; \text{a.s.}\]

To complete the proof, we will need one more technical lemma. Let $|X_s| \leq K < \infty, \forall s \leq T$, then we have

\[\lim_{n\to\infty} \mathbb{E} \sum_{k=0}^{\lfloor t 2^n \rfloor} [ X_{t_{k+1}^n} - X_{t_k^n} ]^4 = 0.\]

Then once again omitting the details, we can get

\[\mathbb{E} |J_2^n - J_3^n| \leq 2 \sup_{x \in B_r} \left| \frac{\partial^2 f}{\partial x^2}(x) \right|^2 \mathbb{E} \left[ \sum_{k=0}^{\lfloor t 2^n \rfloor} [ X_{t_{k+1}^n} - X_{t_k^n} ]^4 + [X]_t \max_{k} ( [X]_{t_{k+1}^n} - [X]_{t_{k}^n} ) \right],\]

which combined with the previous lemma and bounded convergence theorem, we get the desired result

\[\lim_{n\to\infty} |J_2^n - J_3^n| = 0 \; \text{a.s.}\]

Putting everything together gives us the desired formula as stated.

Remark The use of the propositions listed in the previous section is implicit in the two technical lemmas we stated above, where we also hide most of the proof difficulty in.

Interpretation This proof naturally leads to an interpretation that Itô’s Lemma as a consequence of Taylor’s expansion. However this proof provides no clear intuition on why the second order approximation is the correct order, and pushes the justification to complicated technical details. Probably the most troubling consequence is that a different integration scheme (e.g. Stratonovich which rises from a mid-point Riemann sum) leads to a different change of variable formula, therefore the Taylor expansion intuition can lead to further confusion.

Overview of the Alternative Approach

At this point, we will first take a step back from Itô’s Lemma and look at a rough sketch of the proof technique.

Suppose we want to prove a collection of functions (e.g. $C^2([a,b])$) satisfy a certain property $(P)$, we will start by defining $\mathscr{A}$ as the subset of $C^2([a,b])$ that satisfies the desired property $(P)$.

(Step 1) We will identify a certain algebraic structure such that $\mathscr{A}$ is closed under, e.g. for an algebra (over a field) we have if $f,g \in \mathscr{A}$, then $cf + g, fg \in \mathscr{A}$. In other words, an algebra is a vector space with an associative vector multiplication.

(Step 2) Then we can say that the collection $\mathscr{A}$ (or a dense subset) is generated by some very simple functions, e.g. under an algebra, the functions $\{1, x\}$ generate the entire collection of polynomials.

(Step 3) At this point, we use a density argument such as Weierstrass approximation to show $\mathscr{A}$ is dense in $C^2([a,b])$. Specifically, $\forall f \in C^2([a,b])$, $\exists \{f_n\}_{n \geq 1} \subset \mathscr{A}$ such that $f_n \to f$ with respect to some metric $\rho$.

(Step 4) Finally, it is sufficient to show $\mathscr{A}$ is closed under this metric $\rho$. I.e. if $\{f_n\}_{n \geq 1}$ all satisfy $(P)$ are such that $f_n \to f$ in $\rho$, then we have $f$ also satisfies $(P)$, hence $f \in \mathscr{A}$.

Remark The reader may already recognize that the sketch above was intentionally phrased in a very general sense, so we can observe the flexibility of the technique. In fact we can even generalize beyond function spaces, as long as we have an equivalent approximation technique.

The Proof in Detail

We start by stating the key theorem.

Theorem (Stone-Weierstrass, Real Numbers) Let $S$ be a compact Hausdorff space, and $\mathscr{A} \subset C(S, \mathbb{R})$ an algebra which contains a non-zero constant function. Then $\mathscr{A}$ is dense in $C(S, \mathbb{R})$ if and only if it separates points.

Clearly, if we let $S = B_r$, we have a compact Hausdorff space, and the collections of polynomials contains the functions $\{1,x\}$ and separates points. Therefore we have $\mathscr{A}$ is dense in $C(B_r, \mathbb{R}), \forall r > 0$ with respect to the sup-norm.

Applying the same theorem to the derivatives, we then have the same result for $C^2(B_r, \mathbb{R})$ with respect to a similar norm

\[\| f \|_{B_r} := \sup_{x \in B_r, \, m = 0,1,2} \left| \frac{\partial^m f}{\partial x^m} (x) \right|.\]

proof (of Itô’s Lemma): We will similarly use a localization argument, i.e. define $\tau_r := \inf \{t \geq 0 : |X_t| + [X]_t > r \}$, and replace $X_t$ with $X_{t \wedge \tau_r}$.

(Step 1, 2) Let $\mathscr{A} \subset C^2(\mathbb{R})$ be the collection of functions where Itô’s Lemma is satisfied. Trivially we have that $\{1,x\}$ are in $\mathscr{A}$, and $\mathscr{A}$ forms a vector space.

Next we show that $\mathscr{A}$ forms an algebra. In particular, suppose $f,g \in \mathscr{A}$, and define $F_t := f(X_t), G_t := g(X_t)$. Using the product rule gives us

\[F_t G_t - F_0 G_0 = \int_0^t F_s dG_s + \int_0^t G_s dF_s + [F,G]_t \,.\]

Using the Fundamental Theorem and Itô’s Lemma on $g$, we get

\[\int_0^t F_s dG_s = \int_0^t f(X_s) \frac{\partial g}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t f(X_s) \frac{\partial^2 g}{\partial x^2}(X_s) d[X]_s \,.\]

and observe the same is true switching the order of $F,G$. Next we use Itô’s Lemma and expand with the Kunita-Watanabe identity to get

\[[F,G]_t = \int_0^t \frac{\partial f}{\partial x}(X_s) \frac{\partial g}{\partial x}(X_s) d[X]_s \, ,\]

where the extra terms are zero because the covariation with one finite variation process is zero, i.e. $[ \,[X]\, ,Y ]_t = 0$ as $[X]_t$ has finite variation. By grouping the integrals by the integrators (e.g. $d[X]_t$), we get that $fg$ satisfies Itô’s Lemma or simply $fg \in \mathscr{A}$.

(Step 3) Here we can apply the Stone-Weierstrass Theorem to get that $\mathscr{A}$ is dense in $C^2(B_r)$ with respect to the norm $\|\cdot\|_{B_r}$.

(Step 4) It remains to show that $\mathscr{A}$ is closed with respect to $\|\cdot\|_{B_r}$. In particular, let $(f_n)_{n \geq 1}$ be a sequence in $\mathscr{A}$ such that $f_n \to f$ in $\|\cdot\|_{B_r}$. Then we have

\[\int_0^t \left| \frac{\partial^2 f_n}{\partial x^2}(X_s) - \frac{\partial^2 f}{\partial x^2}(X_s) \right| d[X]_s \leq \|f_n - f\|_{B_r} [X]_t \, .\]

At the same time, we also have by Itô’s Isometry

\[\begin{align*} \mathbb{E} \left( \int_0^t \frac{\partial f_n}{\partial x}(X_s) - \frac{\partial f}{\partial x}(X_s) dX_s \right)^2 &= \mathbb{E} \int_0^t \left(\frac{\partial f_n}{\partial x}(X_s) - \frac{\partial f}{\partial x}(X_s) \right)^2 d[X]_s \\ &\leq \|f_n - f\|_{B_r} [X]_t \, . \end{align*}\]

Since the process is localized we have that $[M]_t \leq r$, and therefore we can pass the limit in the Itô formula and get

\[\begin{align*} f(X_t) - f(X_0) &= \lim_{n\to\infty} f_n(X_t) - f_n(X_0) \\ &= \lim_{n\to\infty} \int_0^t \frac{\partial f_n}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t \frac{\partial^2 f_n}{\partial x^2}(X_s) d[X]_s \\ &= \int_0^t \frac{\partial f}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t \frac{\partial^2 f}{\partial x^2}(X_s) d[X]_s \,. \end{align*}\]

Finally, since Itô’s Lemma hold for all $r>0$, we can simply take $r\to\infty$ to complete the proof.

\[\tag*{$\Box$}\]

Remark Clearly the alternative proof is not necessarily easier, however let us observe a couple of advantages.

Firstly, none of the steps above were very complicated, as most steps followed directly from useful (and well known) propositions. Notably, a first time reader of this subject will have a much easier time following the steps and seeing the bigger picture, rather than getting trapped by technical details.

Secondly, we now have an additional interpretation of the second integral in the formula, which clearly arises as a consequence of Itô’s product rule and Kunita-Watanabe identity. For the readers that have not seen the proof, it follows almost directly from the definition, i.e. a direct consequence of choosing the left Riemann sum.

Summary

We have shown the Stone-Weierstrass Theorem is not only a strong result on its own, but leads to a powerful technique in general. In particular, we saw a nice alternative proof of Itô’s Lemma with much better interpretations. Ideally, the author would have liked to add another example, but the post is already quite long at this point. Hopefully the readers will still have enjoyed an interesting blog post, and added another proof technique in their arsenal.

Please comment below (new feature!) for any questions or feedback!

An Interesting Story to Wrap Up

For the longest time, the lemma was credited to Kiyosi Itô alone in his 1950 paper. This was until the 1990s with a resurgence of interests in the late French-German mathematician Wolfgang Doeblin, who was well known to be quite gifted. The interests led to a demand to open the remaining “pli cacheté” (sealed envelope) held by the French Academy of Sciences, which he submitted just before he passed away in 1940 - he burned his notes and took his own life so the German soldiers cannot take advantage of his work. To everyone’s surprise, Doeblin’s letter contained significant research progress ahead of his time, including a statement of the same change of variables formula! To honour his contribution, the result is sometimes referred to as the Itô-Doeblin Lemma.

For the interested readers, I would strongly recommend an excellent commentary by Bernard Bru and Marc Yor (2002) for further details on this topic.

References

Bru, B. & Yor, M. (2002). Comments on the life and mathematical legacy of Wolfgang Doeblin.. Finance and Stochastics, 6, 3-47.
Karatzas, I. & Shreve, S.E. (1991). Brownian Motion and Stochastic Calculus. Springer New York
Miller, J. (2016). Stochastic Calculus, Lent 2016 Lecture Notes. Retrieved from http://statslab.cam.ac.uk/~jpm205/teaching/lent2016/lecture_notes.pdf

Connected by Poincaré Inequality

2017-12-30T00:00:00-05:00

While studying two seemingly irrelevant subjects, probability theory and partial differential equations (PDEs), I ran into a somewhat surprising overlap: the Poincaré inequality. On one hand, it is not out of the ordinary for analysis based subjects to share inequalities such as Cauchy-Schwarz and Hölder; on the other hand, the two forms of Poincaré inequality have quite different applications.

In this blog post, I hope to put together some excellent content I studied recently, specifically from:

Concentration Inequalities by Boucheron, Lugosi, and Massart (2013)
Partial Differential Equations by Evans (2010)

A Simple Inequality

We first state a very simple version of the inequality:

Theorem (A Simple Poincaré Inequality) Let $\Omega \subset \mathbb{R}^n$ be open and bounded, and let $f \in C^1_c(\Omega)$ (differentiable with compact support). Then there exists a constant $C$ that depends only on $\Omega$ such that:

\[ \left\lVert f \right\rVert_{L^2(\Omega)} \leq C \lVert \nabla f \rVert_{L^2(\Omega)} \]

Quick aside: we say a function $f$ has compact support if the set $S = \{ x \in \Omega : f(x) \neq 0 \}$ has compact closure. This implies $f(x) = 0$ near the boundary.

Observe that the inequality simply bounds the $L^2$-norm of a function in terms of the $L^2$-norm of its gradient instead. Note the compact support here is an important assumption when we are integrating with respect to the Lebesgue measure. Consider for example a constant function, then this inequality would fail as the gradient is zero. The reader may be comforted that a general form will require much fewer assumptions, and can be generalized to all $L^p$ norms.

The reason we start with this inequality is because the proof is quite straightforward:

proof (of the Simple Poincaré Inequality):

Without loss of generality, we let $\Omega \subset [0,M]^n$ for some large $M > 0$, and by the Cauchy-Schwarz inequality we have

\[ \vert f(x) \vert^2 \leq \left\vert \int_0^{x_1} \frac{\partial }{\partial x_i} f(y_1, x_2, \ldots) dy_1 \right\vert^2 \leq \left[ \int_0^M 1^2 dy_1 \right] \left[ \int_0^M \left\vert \frac{\partial f}{\partial x_1} \right\vert^2 dy_1 \right] \]

Summing over all $n$ possible derivatives, and integrating over $\Omega$ we have

\[ n \int_\Omega \vert f(x) \vert^2 \leq \int_\Omega \sum_{i=1}^n M \int_0^M \left\vert \frac{\partial f}{\partial x_i} \right\vert^2 dy_i = \sum_{i=1}^n M \left\lVert \frac{\partial f}{\partial x_i} \right\rVert^2_{L^2(\Omega)} \]

where in the last step we exchanged the order of integration, and used the fact that the $L^2$ norm is a constant. Rewriting the above we get the desired result

\[ \lVert f \rVert_{L^2(\Omega)} \leq \frac{M}{\sqrt{n}} \lVert \nabla f \rVert_{L^2(\Omega)} \]

\[\tag*{$\Box$}\]

As a Concentration Inequality

We now state the inequality in a form most useful for probability theory, see Theorem 3.20 from Boucheron, Lugosi, Massart (2013):

Theorem (Gaussian-Poincaré Inequality) Let $X = (X_1, \ldots, X_n)$ be a vector of i.i.d. standard Gaussian random variables. Let $\, f : \mathbb{R}^n \to \mathbb{R}$ be any continuously differentiable function. Then

\[ \text{Var}[f(X)] \leq \mathbb{E}\left[ | \nabla f(X)|^2 \right] \]

Observe that the inequality is slightly different. Firstly this time the norm is centered, although centering in this case is not an issue since $Var[X] \leq \mathbb{E}X^2$. Secondly due to the measure being a probability measure, we have a much smaller constant on the inequality $C=1$. In combination, we were also able to drop the compact support assumption.

An immediate consequence is to consider $f$ Lipschitz with coefficient $1$, i.e. $| f(x) - f(y) | \leq \|x - y\|$, then we have

\[ \text{Var}[f(X)] \leq 1 \]

In other words, we just found a constant bound on the variance for a huge class of random functions! In general, we can consider $f$ to be a smooth estimator based on a dataset with noise $X$. The Poincaré inequality will provide a very useful bound on estimation error.

To prove this inequality, we will use a famous result from 1981 (Theorem 3.1 in Boucheron, Lugosi, Massart (2013)):

Theorem (Efron-Stein Inequality) Let $X = (X_1, \ldots, X_n)$ be a vector of i.i.d. random variables and let $Z = f(X)$ be a square-integrable function of $X$. Then

\[ \text{Var}(Z) \leq \sum_{i=1}^n \mathbb{E} \left[ \left( Z - \mathbb{E}^{(i)}Z \right)^2 \right] \]

where $\mathbb{E}^{(i)}Z = \int f(X_1, \ldots, X_{i-1}, x_i, X_{i+1},\ldots) d\mu_i(x_i)$, i.e. the expectation over $X_i$ only.

The Efron-Stein inequality can be proved by decomposing the variance as a sum of telescoping differences of conditional expectations, and applying Jensen’s inequality to the individual terms. While we omit the proof here, we should remark that the simple Efron-Stein inequality has wide ranging applications; we will only look at one such use for the proof of the Poincaré inequality, taken from Theorem 3.20 in Boucheron, Lugosi, Massart (2013):

proof (of Gaussian-Poincaré Inequality):

First we observe that a direct application of the Efron-Stein inequality can reduce the problem down to $n=1$, i.e. it is sufficient to show

\[ \mathbb{E}^{(i)} \left[ \left( Z - \mathbb{E}^{(i)}Z \right)^2 \right] \leq \mathbb{E}^{(i)} \frac{\partial f}{\partial x_i}(X)^2 \]

From here we assume without loss of generality $n=1$. Then we notice that it is sufficient to prove this inequality for compactly supported, twice differentiable functions, i.e. $f \in C_c^2(\mathbb{R})$, since otherwise we can just take a limit to the original function.

Here we let $\epsilon_1,\ldots,\epsilon_n$ be i.i.d. Rademacher random variables, i.e. $\mathbb{P}[\epsilon_j = 1] = \mathbb{P}[\epsilon_j = -1] = \frac{1}{2} \,\forall j \in \{ 1,2,\ldots,n \}$, and we define

\[ S_n = n^{-1/2} \sum_{j=1}^n \epsilon_j \]

Observe that for every $i$ we have

\[ \text{Var}^{(i)}[f(S_n)] = \frac{1}{4} \left[ f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right]^2 \]

Applying the Efron-Stein inequality, we get

\[ \text{Var}[f(S_n)] \leq \frac{1}{4} \sum_{i=1}^n \mathbb{E} \left[ \left( f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right)^2 \right] \]

Let $K = \sup_x \vert f''(x) \vert$, then we have that

\[ \left|f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right)\right| \leq \frac{2}{\sqrt{n}} |f’(S_n)| + \frac{2K}{n} \]

which implies

\[ \frac{n}{4} \left( f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right)^2 \leq f’(S_n)^2 + \frac{2K}{\sqrt{n}} | f’(S_n) | + \frac{K^2}{n} \]

Finally the central limit theorem then imply the desired result

\[ \limsup_{n\to\infty} \frac{1}{4} \sum_{i=1}^n \mathbb{E} \left[ \left( f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right)^2 \right] = \mathbb{E} \left[ f’(X)^2 \right] \]

\[\tag*{$\Box$}\]

Remark There are also Poincaré type inequalities for non-Gaussian random variables, for example if $X\sim$Poisson$(\mu)$:

\[ \text{Var}[f(X)] \leq \mu \mathbb{E}\left[ (f(X+1) - f(X))^2 \right] \]

Or if $X$ is double exponential i.e. with density $\frac{1}{2}e^{-\vert x \vert}$, then we have:

\[ \text{Var}[f(X)] \leq 4 \mathbb{E}\left[ (f’(X))^2 \right] \]

An Application to PDEs

To do proper justice for the theory of PDEs, we will need a significant background in functional analysis. In this section, we will try to side-step the technical details and focus on one single application, that is showing the existence and uniqueness of a weak solution for Poisson’s equation:

\[ -\Delta u = f \; \text{ in } \Omega \] \[ u = g \; \text{ in } \partial \Omega \]

where $\Omega \subset \mathbb{R}^n$ is open bounded with smooth boundaries $\partial \Omega$. By weak solution, we meant there exists a $u \in C^1(\Omega)$ such that $\forall v \in C^1_c(\Omega)$ we have

\[ B[u,v] := \int_\Omega \nabla u \cdot \nabla v = \int_\Omega f v \]

Note if $u$ is a solution to the (original) Poisson’s equation, then we have the above weak equation by Green’s identity. The main tool we will use to prove existence and uniqueness is the following result:

Theorem (Lax-Milgram) Let $H$ be a Hilbert space, $B: H^2 \to \mathbb{R}$ be a continuous, coersive, bilinear form. Then $\forall \varphi \in H^*$, there exists a unique $u\in H$ such that

\[ B(u,v) = <\varphi, v> \quad \forall v \in H \]

where $<\varphi,v>$ is the linear functional $\varphi$ applied to $v$.

Remark Before going into the definitions and technical details, we observe that Lax-Milgram Theorem gives us exactly what we want - the existence and uniqueness! Now we just have to fill in the blanks:

define a Hilbert space $H$ where the solution lives in
show that $B(u,v)$ is continuous and coersive (we will define later)
let $<\varphi, v> = \int_\Omega f v$

Step 1 To define our Hilbert space, we will consider the following inner product:

\[ (u,v) := \int_\Omega [uv + \nabla u \cdot \nabla v] \]

which corresponds to the following Sobolev norm:

\[ \lVert u \rVert_{H^{1}(\Omega)} := (u,u)^{1/2} = \left[ \lVert u \rVert_{L^2(\Omega)}^2 + \lVert \nabla u \rVert_{L^2(\Omega)}^2 \right]^{1/2} \]

By equipping the space $C^1_c(\Omega)$ with the above inner product, we almost have a Hilbert space! Here we will simply take the completion of $C^1_c(\Omega)$ with respect to the Sobolev norm, i.e. add all the limit points to the space. We call this (completed) Hilbert space $H_0^1(\Omega)$.

Step 2 We now turn our attention to $B(u,v)$, the bilinear form (fancy term for separately linear in both inputs). Then we say $B$ is continuous if

\[ \exists C_1 > 0 : \forall u,v \in H, \vert B(u,v) \vert \leq C_1 \lVert u \rVert_{H^1(\Omega)} \lVert v \rVert_{H^1(\Omega)} \]

Note this is an immediate consequence of Cauchy-Schwarz inequality

\[ \vert B(u,v) \vert \leq \lVert \nabla u \rVert_{L^2(\Omega)} \lVert \nabla v \rVert_{L^2(\Omega)} \leq \lVert u \rVert_{H^1(\Omega)} \lVert v \rVert_{H^1(\Omega)} \]

We say $B$ is coersive if

\[ \exists C_2 > 0 : \forall u \in H, B(u,u) \geq C_2 \lVert u \rVert_{H^1(\Omega)}^2 \]

We notice this is the only non-trivial condition left to check, and to prove this we will finally use Poincaré inequality! Start by rewriting

\[ B(u,u) = \int_\Omega \nabla u \cdot \nabla u = \lVert \nabla u \rVert_{L^2(\Omega)}^2 \]

Applying the Poincaré inequality on half of the norm we have

\[ \frac{1}{2} \lVert \nabla u \rVert_{L^2(\Omega)}^2 \geq \frac{1}{2C} \lVert u \rVert_{L^2(\Omega)}^2 \]

Therefore

\[ B(u,u) \geq \frac{1}{2} \lVert \nabla u \rVert_{L^2(\Omega)}^2 + \frac{1}{2C} \lVert u \rVert_{L^2(\Omega)}^2 \geq \min\left(\frac{1}{2}, \frac{1}{2C}\right) \lVert u \rVert_{H^1(\Omega)}^2 \]

And voilà, we have existence and uniqueness! A rigorous and careful reader may notice that $u$ does not necessarily have compact support - this is correct. However every $u \in H_0^1(\Omega)$ is a limit of compactly supported functions, therefore we just need to take a limit to get our result!

Remark In fact, we can use similar Lax-Milgram based methods to show existence and uniqueness for a large subset of elliptical PDEs. We should note that the fact we can “convert” between $\|u\|$ and $\|\nabla u\|$ is highly useful for studying Sobolev norms. We refer curious readers to Evans (2010) for an excellent chapter on Sobolev spaces and related inequalities.

Final Words

I have a weak spot for connections between different fields, probably because it’s always surprising, and surprises are intriguing in math! I hope to have to presented a readable introduction to the inequality and its applications in both topics, without drowning readers in technical details. On this note, I should remark that to study Sobolev spaces rigorously, the reader will need to go through all the details carefully!

As this is my first blog post, any constructive feedback or suggestions on future topics will be appreciated!

References

Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
Evans, L. C. (2010). Partial differential equations. Providence, R.I.: American Mathematical Society.