Jekyll2023-09-21T17:10:32-04:00https://mufan-li.github.io/feed.xmlMufan (Bill) Li{"name"=>"", "avatar"=>"Profile3.jpg", "bio"=>nil, "location"=>nil, "employer"=>nil, "pubmed"=>nil, "googlescholar"=>"https://scholar.google.com/citations?user=9dSlc_cAAAAJ", "email"=>"mufan.li@princeton.edu", "researchgate"=>nil, "uri"=>nil, "bitbucket"=>nil, "codepen"=>nil, "dribbble"=>nil, "flickr"=>nil, "facebook"=>nil, "foursquare"=>nil, "github"=>nil, "google_plus"=>nil, "keybase"=>nil, "instagram"=>nil, "impactstory"=>nil, "lastfm"=>nil, "linkedin"=>"mufan-bill-li-35749833", "orcid"=>nil, "pinterest"=>nil, "soundcloud"=>nil, "stackoverflow"=>nil, "steam"=>nil, "tumblr"=>nil, "twitter"=>"mufan_li", "vine"=>nil, "weibo"=>nil, "xing"=>nil, "youtube"=>nil, "wikipedia"=>nil}mufan.li@princeton.eduAn Unusally Clean Proof: Dyson Brownian Motion via Conditioning on Non-intersection2021-06-26T00:00:00-04:002021-06-26T00:00:00-04:00https://mufan-li.github.io/dyson_doob<p>Dyson Brownian motion [Dy62] is best known to characterize the eigenvalues of special random matrices [Ta12]. Most interestingly, it is also equal in distribution to \(n\) independent Brownian motions conditioned to not intersect [Gr99]. In a topics course by <a href="http://www.math.toronto.edu/balint/">Bálint Virág</a>, I came across a proof of this result that is <em>just too clean</em> for this type of calculations. After picking up my jaw from the ground months later, I finally decided to write up this surprisingly elegant proof.</p>
<h2 id="background-doobs-h-transform">Background: Doob’s h-Transform</h2>
<p>To compute the conditional dynamics of Markov processes, we will use the h-transform by <a href="https://en.wikipedia.org/wiki/Joseph_L._Doob">Joseph Doob</a> [Bl10]. Let us consider a time-homogeneous Markov process \(\{ X(t) \}_{t \geq 0}\) to be conditioned on a shift invariant event \(A\), i.e.</p>
\[\mathbb{P}( \{ X(t) \}_{t\geq 0} \in A | X(0) = x )
= \mathbb{P}( \{ X(t+s) \}_{t\geq 0} \in A | X(s) = x ) \,.\]
<p>An important example of a shift invariant event is the <em>gambler’s ruin</em> example, where \(X(t) \in (0,c)\) is a martingale, and the event can be defined as</p>
\[A := \left\{ X(t) \text{ hits } c \text{ before } 0 \right\} \,.\]
<p>Here \(X(t)\) is intended to model a gambler’s wealth process in a fair betting game, and it’s well known that \(\mathbb{P}(A) = \frac{X(0)}{c}\) (a direct consequence of optional stopping theorem).</p>
<p>We will provide a simple sketch of the h-transform result (see [Bl10] for a rigorous proof). Here we introduce the following notations:</p>
\[\begin{split}
\mathbb{P}_x( \cdot ) &:= \mathbb{P}( \cdot | X(0) = x ) \,,
\\
h(x) &:= \mathbb{P}_x(A) \,,
\\
P^t(x, dy) &:= \mathbb{P}_x( X(t) \in dy ) \,,
\\
\tilde{P}^t(x, dy) &:= \mathbb{P}_x( X(t) \in dy | A) \,,
\end{split}\]
<p>where \(h(x)\) is the key transform function, and \(P^t(x,dy)\) is the transition kernel, which completely characterizes the dynamics of a Markov process. Therefore our goal is to compute \(\tilde{P}^t(x,dy)\), which we can just use Bayes’ rule</p>
\[\begin{aligned}
\tilde{P}^t(x, dy) &= \mathbb{P}_x( X(t) \in dy | A)
\\
&= \frac{ \mathbb{P}_x(A | X(t) \in dy) \mathbb{P}_x( X(t) \in dy ) }{ \mathbb{P}_x(A) }
\\
&= \frac{h(y)}{h(x)} P^t(x, dy) \,,
\end{aligned}\]
<p>where we used the shift invariance of \(A\) to write \(\mathbb{P}_x(A \vert X(t) \in dy) = h(y)\). In other words, the Radon–Nikodym derivative for the transition kernel is simply the ratio \(h(y)/h(x)\)!</p>
<p>To make calculations even simpler, we will also compute the effect on the infinitesimal generator, namely the operator \(L\) defined as follows:</p>
\[L[f](x) := \lim_{t \to 0} \frac{ \mathbb{E}_x f(X(t)) - f(x) }{t} \,,
\quad \text{ if it exists, }\]
<p>where we define \(\mathbb{E}_x(\cdot) := \mathbb{E}( \cdot \vert X(0) = x)\).</p>
<p>We want to then compute</p>
\[\begin{aligned}
\tilde{L}f
&:=
\lim_{t \to 0} \frac{ \mathbb{E}_x( f(X(t)) | A ) - f(x) }{t}
\\
&=
\lim_{t \to 0} \frac{1}{t} \left(
\int f(y) \tilde{P}^t(x,dy) - f(x)
\right)
\\
&=
\lim_{t \to 0} \frac{1}{t} \left(
\int f(y) \frac{h(y)}{h(x)} P^t(x,dy) - f(x)
\right)
\\
&=
\frac{1}{h(x)} \lim_{t \to 0} \frac{1}{t} \left(
\int (f(y) h(y) - f(x) h(x)) P^t(x,dy)
\right)
\\
&=
\frac{1}{h(x)} L[fh](x) \,,
\end{aligned}\]
<p>where we used the h-transform result above and the definition of the generator.</p>
<p>At this point, we will recall a well known that \(h\) is harmonic, i.e. \(Lh = 0\), to save some calculations (I suspect the letter “h” in h-transform stands for “harmonic”). We also recall that for a diffusion process \(dX(t) = \mu(X(t))\,dt + dB(t)\), the generator follows from Itô’s Lemma</p>
\[L[f](x) := \langle \mu(x), \nabla f(x) \rangle + \frac{1}{2} \Delta f(x) \,.\]
<p>Using the harmonic property, we have the following clean formula for the transformed generator</p>
\[\tilde{L} [f](x) = \frac{ L[fh](x) }{h(x)} = \langle \mu(x) + \nabla \log h(x), \nabla f(x) \rangle + \frac{1}{2} \Delta f(x) \,,\]
<p>which corresponds to the diffusion process</p>
\[d\tilde{X}(t) = ( \mu(\tilde{X}(t)) + \nabla \log h(\tilde{X}(t)) ) \, dt + dB(t) \,.\]
<p>To summarize, <em>h-transform simply adds a</em> <strong>drift term</strong> <em>to the original process</em>! Here we remark that although the above formula looks simple in terms of \(h(x)\), the function \(h(x)\) itself is often quite complicated, making this calculation at least convoluted if not <em>completely intractable</em>. This is why the proof coming out so clean in the next section is a huge surprise.</p>
<h2 id="dyson-brown-motion-via-conditioning-on-non-intersection">Dyson Brown Motion via Conditioning on Non-intersection</h2>
<p>Here we will first state the main result.</p>
<hr />
<p><strong>Theorem</strong> Let \(\{ \lambda_i(t) \}_{t\geq 0, i \in [n]}\) be the Dyson Brownian motions, i.e. \(\lambda(t)\) satisfy the following stochastic differential equation (SDE)</p>
\[d\lambda_i(t) = dB_i(t) + \sum_{j \neq i} \frac{dt}{ \lambda_i(t) - \lambda_j(t) } \,,\]
<p>where the initial conditions satisfy \(\lambda_1(0) > \lambda_2(0) > \cdots > \lambda_n(0)\), and {\(B_i(t)\)} are independent standard Brownian motions. Then we have the following equality in distribution</p>
\[\{ \lambda_i(t) \}_{t \geq 0, i \in [n]} \overset{d}{=} \{ B_i(t) \}_{t \geq 0, i \in [n]} \vert A \,,\]
<p>where we define \(A := \{ \{B_i(t)\} \text{ do not intersect } \}\).</p>
<hr />
<p>Before we start, we note the event of \(n\) Brownian motions to not intersect for all time has zero probability. Then what does it even mean to condition on an event of zero probability? Well we would consider a collection of events \(\{A_c\}_{c>0}\) converging to the null event \(A\), such that \(\mathbb{P}(A_c) > 0\) for all \(c>0\), and we can compute dynamics of these Brownian motions in the limit as \(c\to \infty\).</p>
<p>To define these events \(A_c\), we will define the <em>Vandermonde determinant</em>:</p>
\[\Delta_n( \lambda ) = \prod_{1 \leq i < j \leq n} ( \lambda_i - \lambda_j ) \,.\]
<p>Here we observe that since \(\lambda\) is sorted in decreasing order, we have that</p>
\[\Delta_n( \lambda ) > 0 \iff \lambda_i \neq \lambda_j \quad \forall i \neq j \,.\]
<p>Therefore, we can define the events \(A_c := \{ \Delta_n( B(t) ) \text{ hits } c \text{ before } 0 \}\). Observe that we can indeed recover the non-intersection event \(A\) in the limit</p>
\[A = \lim_{c \to \infty} A_c \,.\]
<p>Recalling the gambler’s ruin example, if \(\Delta_n( B(t) )\) is a martingale, we have a very simple formula for the h-transform</p>
\[h_c( x ) := \mathbb{P}_{x}( A_c ) = \frac{ \Delta_n(x) }{ c } \,.\]
<p>Indeed we will first prove this result.</p>
<hr />
<p><strong>Lemma</strong> \(\Delta_n(B(t))\) is a martingale.</p>
<hr />
<p><em>proof (of Lemma):</em> We will directly compute the SDE of \(\Delta_n(B(t))\) using Itô’s Lemma</p>
\[d \Delta_n(B(t))
= \frac{1}{2} \Delta \Delta_n(B(t)) \, dt + \cdots dB(t) \,,\]
<p>where hide the diffusion term since the Itô integral with respect to a martingale is also a martingale. Therefore it’s sufficient to show the drift term is zero.</p>
<p>Using the identity</p>
\[\frac{1}{(a-b)(a-c)} + \frac{1}{(b-a)(b-c)} + \frac{1}{(c-a)(c-b)} = 0 \,,\]
<p>it’s a simple calculation to show that \(\Delta \Delta_n(x) = 0\) via this symmetry, and hence the desired result follows.</p>
\[\tag*{$\Box$}\]
<p><em>proof (of Theorem):</em> It remains to compute the h-transform dynamics, in particular the drift term</p>
\[\begin{aligned}
\partial_i \log h_c(x)
&=
\partial_i ( \log \Delta_n(x) - \log c )
\\
&=
\frac{1}{\Delta_n(x)} \sum_{j \neq i} \frac{\Delta_n(x)}{ x_i - x_j } \,,
\end{aligned}\]
<p>which implies that conditioned on the event \(A_c\), the h-transformed process satisfy the SDE</p>
\[\begin{aligned}
d \lambda_i(t)
&= \partial_i \log h_c( \lambda(t) ) \, dt + dB_i(t)
\\
&= \sum_{j \neq i} \frac{ dt }{ \lambda_i(t) - \lambda_j(t) } + dB_i(t) \,.
\end{aligned}\]
<p>Finally, to complete the proof, we observe the dynamics of \(\lambda(t)\) is invariant to changes in \(c>0\), therefore taking \(c \to \infty\) recovers the unbounded dynamics of Dyson Brownian motion.</p>
\[\tag*{$\Box$}\]
<h2 id="final-words">Final Words</h2>
<p>That’s it! That’s the proof! Having played around with h-transforms before, and getting only ridiculously ugly expressions, it’s quite remarkable to me that this proof was able to avoid messy calculations all together.</p>
<p>What helped simplified this proof? To quote Bálint Virág: “[t]his is because the Vandermonde [determinant] is harmonic, so this fits into the h-transform language.” Indeed, we saw above that \(\Delta \Delta_n(x) = 0\) played an important role in the calculations. So let this be a rule of thumb for future problems: try to define events using harmonic functions in h-transforms!</p>
<p>For those interested in random matrix theory, Terrence Tao wrote some very nice <a href="https://terrytao.wordpress.com/2010/01/18/254a-notes-3b-brownian-motion-and-dyson-brownian-motion/">lecture notes</a> on the applications of Dyson Brownian motion, which can also be found in his book [Ta12]. In particular, we can use Dyson Brownian motion to derive the eigenvalue density of a Gaussian unitary ensemble (GUE) matrix, common known as the <em>Ginibre formula</em>:</p>
\[\rho(\lambda) = \frac{1}{(2\pi)^{n/2} 1! \cdots (n-1)!} e^{ -|\lambda|^2/2 } |\Delta_n(\lambda)|^2 \,,\]
<p>which can then be used to derive the famous <a href="https://mathworld.wolfram.com/WignersSemicircleLaw.html">Wigner’s semicircle law</a> in the limit \(n \to \infty\).</p>
<h2 id="references">References</h2>
<!-- - \[Bak08\] Y. Bakhtin, "Exit asymptotics for small diffusion about an unstable equilibrium." Stochastic processes and their applications 118.5 (2008): 839-851. -->
<ul>
<li>[Bl10] Bloemendal, Alex. “Doob’s h-transform: theory and examples.” Lecture notes (2010).</li>
<li>[Dy62] Dyson, Freeman J. “A Brownian‐motion model for the eigenvalues of a random matrix.” Journal of Mathematical Physics 3.6 (1962): 1191-1198.</li>
<li>[Gr99] Grabiner, David J. “Brownian motion in a Weyl chamber, non-colliding particles, and random matrices.” Annales de l’IHP Probabilités et statistiques. Vol. 35. No. 2. 1999.</li>
<li>[Ta12] Tao, Terence. Topics in random matrix theory. Vol. 132. American Mathematical Soc., 2012.</li>
</ul>{"name"=>"", "avatar"=>"Profile3.jpg", "bio"=>nil, "location"=>nil, "employer"=>nil, "pubmed"=>nil, "googlescholar"=>"https://scholar.google.com/citations?user=9dSlc_cAAAAJ", "email"=>"mufan.li@princeton.edu", "researchgate"=>nil, "uri"=>nil, "bitbucket"=>nil, "codepen"=>nil, "dribbble"=>nil, "flickr"=>nil, "facebook"=>nil, "foursquare"=>nil, "github"=>nil, "google_plus"=>nil, "keybase"=>nil, "instagram"=>nil, "impactstory"=>nil, "lastfm"=>nil, "linkedin"=>"mufan-bill-li-35749833", "orcid"=>nil, "pinterest"=>nil, "soundcloud"=>nil, "stackoverflow"=>nil, "steam"=>nil, "tumblr"=>nil, "twitter"=>"mufan_li", "vine"=>nil, "weibo"=>nil, "xing"=>nil, "youtube"=>nil, "wikipedia"=>nil}mufan.li@princeton.eduDyson Brownian motion [Dy62] is best known to characterize the eigenvalues of special random matrices [Ta12]. Most interestingly, it is also equal in distribution to \(n\) independent Brownian motions conditioned to not intersect [Gr99]. In a topics course by Bálint Virág, I came across a proof of this result that is just too clean for this type of calculations. After picking up my jaw from the ground months later, I finally decided to write up this surprisingly elegant proof.On Escape Time, Lyapunov Function, Poincaré Inequality, and the KLS Conjecture Beyond Convexity2021-01-13T00:00:00-05:002021-01-13T00:00:00-05:00https://mufan-li.github.io/lyapunov_escape<p>Nobody has time to read an <a href="https://arxiv.org/abs/2010.11176">80 page paper</a>
[LE20].
Therefore I doubt most readers realized the manifold Langevin algorithm paper
actually contains a novel technique for establishing functional inequalities.
And I really doubt anyone had time to interpret the intuitive consequences
of such results on perturbed gradient descent,
and definitely not extending the Kannan-Lovász-Simonovits (KLS)
conjecture [LV18] -
which brings me to write this blog post.</p>
<h2 id="background">Background</h2>
<p>Let us start with a potential function \(F : \mathbb{R}^d \to \mathbb{R}\),
an inverse temperature parameter \(\beta > 0\),
and we define the <strong>Gibbs density</strong> as</p>
\[\nu(x) := \frac{1}{Z} e^{ -\beta F(x) } \,,\]
<p>where \(Z = \int e^{-\beta F(x)} dx\) is the normalizing constant.</p>
<p>We say \(\nu\) satisfies the <strong>Poincaré inequality</strong>
with constant \(\kappa > 0\),
denoted \(\text{PI}(\kappa)\), if</p>
\[\int f^2 \, d\nu - \left( \int f \, d\nu \right)^2
\leq
\frac{1}{\kappa \, \beta}
\int | \nabla f |^2 \, d\nu \,,\]
<p>for all \(f \in C^1(\mathbb{R}^d) \cap L^2(\nu)\).
Note we adopt the convention of [BGL13] which adjusts
the right hand side by a factor of \(\beta\),
and the two conventions agree when \(\beta = 1\).</p>
<p>\(\text{PI}(\kappa)\) is well known to be equivalent to
exponential convergence of Langevin diffusion
[BGL13, Theorem 4.2.5],
quadratic-linear cost transport inequality [Vil08, Theorem 22.25],
and Cheeger’s isoperimetric inequality [LV18, Theorem 11].
Furthermore, \(\text{PI}(\kappa)\) also
implies dimension free exponential concentration [Vil08, Theorem 22.32],
and serves as a key tool for deriving existence, uniqueness, and smoothness
results in partial differential equations [Eva10].
Therefore a tight lower bound for the Poincaré constant
is widely desired for a large range of applications.</p>
<h3 id="interpreting-the-poincaré-constant">Interpreting the Poincaré Constant</h3>
<p>Firstly, we will recall the (overdamped) Langevin diffusion
is defined by the following stochastic differential equation (SDE)</p>
\[dX_t = \underbrace{ - \nabla F(X_t) \, dt }_{ \text{gradient flow}_{} }
+ \underbrace{ \sqrt{ 2/\beta } \, dW_t
}_{ \text{perturbation}_{} }\,,\]
<p>where \(\{W_t\}_{ t_{} \geq 0 }\) is a standard \(d\)-dimensional
Brownian motion.
Observe that when \(\beta\) becomes large,
the Brownian motion term becomes very small.
Therefore Langevin diffusion can be interpreted as
a perturbed gradient flow.</p>
<p>Since the Gibbs density \(\nu\) finds the global minimum
of \(F\) as \(\beta \to \infty\),
a dimension and temperature free Poincaré constant
implies a fast convergence of Langevin diffusion to the global minimum
when \(\beta\) is large.
Therefore it is no surprise that</p>
<ol>
<li>a strongly convex \(F\) implies \(\nu\) has
a dimension and temperature free Poincaré constant,
more famously known as the Bakry-Émery criterion
[BGL13, Proposition 4.8.1];</li>
<li>a non-convex \(F\) with multiple isolated local minima leads to
a Poincaré with exponentially poor dependence on \(\beta\),
more famously known as the Eyring-Kramers formula [Ber11].</li>
</ol>
<p>In other words, strongly convex functions are easy to optimize,
general non-convex functions are hard, what else is new?
<strong>What is new</strong> are the cases in between:
non-strongly convex functions with a unique minimum.
<!-- -->
However, even when weakening to \(F\) being only convex,
this problem remains open -
this is an equivalent formulation of the KLS conjecture [LV18].</p>
<hr />
<p><strong>Conjecture (Kannan-Lovász-Simonovits, Poincaré version)</strong>
There exists a universal constant \(\kappa > 0\),
such that for all positive integer \(d\),
and all convex function \(F: \mathbb{R}^d \to \mathbb{R}\)
such that the Gibbs density \(\nu(x) = \frac{1}{Z} e^{-F(x)}\)
(note \(\beta = 1\) here)
has zero mean and identity covariance matrix,
we have that \(\nu\) satisfies \(\text{PI}(\kappa)\).</p>
<hr />
<p>I should briefly mention that a recent arXiv preprint [Che20]
proposed a result equivalent to an almost constant lower bound
on the Poincaré constant of order \(O( d^{-o(1)} )\),
which in the limit of \(d\to\infty\) converges to \(0\)
slower than \(d^{-r}\) for all \(r>0\).
For the sake of staying on topic,
we will leave this extremely interesting subject for a future post.
<!-- As of the writing of this post,
\[Che20\] still awaits confirmation by the community.
An earlier draft of \[LV16\] that claimed a similar result,
and it was eventually corrected to a weaker
(but still the best known) lower bound.
Regardless, we will not dive deeper in this direction.
--></p>
<p>Furthermore, it is already known that
an adaptive perturbation of gradient descent
escapes saddle points at a dimension free rate [JGN+17].
This hints at the possibility of establishing
a dimension and temperature free Poincaré inequality
for even non-convex potential functions!
Indeed, we will discuss this next.</p>
<h2 id="non-convex-poincaré">Non-Convex Poincaré</h2>
<p>The main result of this blog post is actually
an intermediate result of [LE20, Proposition 9.11].
While the original proposition is proved for
a product manifold of spheres,
it can be easily adapted to \(\mathbb{R}^d\)
with a containment type condition,
see for example [Vil06, Theorem A.1]
and [MS14, Assumption 1.4].
Since as of writing this post,
there is no complete proof of this adaptation,
we will state it as a claim.</p>
<hr />
<p><strong>Claim (Adapting [LE20, Proposition 9.11])</strong>
Suppose \(F:\mathbb{R}^d \to \mathbb{R}\)
have a unique local (and therefore global) minimum,
and all saddle points are strict,
i.e. the minimum eigenvalue
\(\lambda_{\text{min}}( \nabla^2 F ) < - \lambda\)
for some constant \(\lambda>0\) at saddle points.
Then under appropriate containment conditions,
and choosing \(\beta\) sufficiently large,
we have that \(\nu(x) = \frac{1}{Z} e^{-\beta F(x)}\)
satisfies \(\text{PI}(\kappa)\)
for a constant \(\kappa>0\) independent of \(\beta, d\).</p>
<hr />
<p>As we discussed earlier,
perturbation helps gradient descent escape saddle points,
therefore this result is very intuitive.
However, deriving a quantitative bound is completely non-trivial.
We remark that for non-convex potentials,
most approaches to establishing a Poincaré inequality
will yield exponentially poor dependence on both \(\beta\) and \(d\).
To our best knowledge, only [MS14] has established a bound
independent of \(\beta\), and (by our calculations) exponential in \(d\).</p>
<!-- Of course, there is the condition on $$\beta$$ sufficiently large,
but we have good intuitions (later in this post) to believe
this is a rather technical constraint.
More precisely, we speculate that with a sharpened analysis,
we can show that for all $$\beta \geq 1$$,
$$\nu(x)$$ satisfies $$\text{PI}(\kappa)$$
with $$\kappa$$ independent of $$\beta, d$$! -->
<h3 id="implications">Implications</h3>
<p>Let us start by emphasizing this result <strong>does not</strong>
imply the KLS conjecture.
Indeed, the conditions of the conjecture does not
require \(F\) to have a unique minimum.
Hence \(F\) needs not to be strongly convex around the minimum.
This is a key requirement of the proof technique.</p>
<p>It does, however, imply that the KLS conjecture
can be extended beyond convex functions.
More precisely, if a potential \(F\) satisfies
a dimension and temperature free Poincaré inequality,
we can add saddle points to \(F\) without losing this property.
<!-- -->
Therefore, we can replace convex potentials \(F\) in the statement
of the KLS conjecture with modifications of convex potentials \(F\)
with strict saddle points.
In fact, this naturally leads us to further conjecture
the strictness of saddle points can be relaxed as well,
since it’s the parallel of relaxing strong convexity to convexity
for saddle points.</p>
<p>Additionally, notice that we can take \(\beta\)
to be as large as we want.
This implies that the amount of randomness added to gradient flow
does not affect its ability to escape saddle points.
In other words, any tiny amount of perturbation will
help escape saddle points.
Furthermore, we emphasize this implies
<strong>a discretization of Langevin diffusion</strong>,
i.e. a perturbed gradient descent,
will also escape strict saddle points -
this was the main result of [LE20].
This is in sharp contrast with (deterministic) gradient descent
taking up to exponential time to escape a saddle point [DJL+17],
implying <strong>the addition of noise, even arbitrarily small,
fundamentally changes the behaviour of gradient descent</strong>.</p>
<!-- We also note the continuous time result is not new,
as there have been asymptotic analysis of saddle point escape time
in the limit of $$\beta \to \infty$$ \[Bak08\],
this is the first non-asymptotic characterization to our best knowledge.
-->
<h2 id="proof-sketch---step-1-lyapunov-function-away-from-saddle-points">Proof Sketch - Step 1: Lyapunov Function Away From Saddle Points</h2>
<p>Now to my favourite part of this post,
where we actually describe the proof techniques.
We will see that despite the lengthy calculations in [LE20],
the proof idea is quite straight forward to explain.
We start by stating a Lyapunov criterion for
the Poincaré inequality.</p>
<hr />
<p><strong>Theorem [BBCG08, Theorem 1.4 Adapted]</strong>
Let \(U \subset \mathbb{R}^d\) be such that
\(\nu\) restricted to \(U\) satisfies \(\text{PI}(\kappa_U)\).
Suppose there exist constants \(\theta > 0, b \geq 0\)
and a function \(V \in C^2(\mathbb{R}^d)\) such that
\(V \geq 1\) and</p>
\[LV := \langle -\nabla F, \nabla V \rangle
+ \frac{1}{\beta} \Delta V
\leq
-\theta \, V + b {1}_{U_{}} \,,\]
<p>Then \(\nu\) satisfies \(\text{PI}(\kappa)\) with constant</p>
\[\kappa = \frac{ \theta }{ 1 + b / \kappa_U} \,.\]
<hr />
<p>Intuitively, we can think of the Lyapunov function \(V\)
as an energy measure of the Langevin diffusion \(\{X_t\}_{t_{} \geq 0}\),
\(LV\) as the time evolution of \(V\) via
<a href="https://en.wikipedia.org/wiki/It%C3%B4%27s_lemma">Itô’s Lemma</a>,
and the Lyapunov condition (inequality)
describes the rate of energy dissipation over time.
<!-- -->
This energy \(V\) will decrease as \(X_t\) gets closer to \(U\),
hence \(U\) behaves like an attractor.
Once \(X_t\) reaches \(U\), the process begins to “mix”
due to the Poincaré inequality on \(U\).
<!-- -->
In our case, we will choose \(U\) to be a small neighbourhood
of the global minimum,
and use the strong convexity (Bakry-Émery criterion)
to get a Poincaré constant \(\kappa_U\).
<!-- -->
For those that find this description familiar,
indeed this is the diffusion equivalent of the drift
and minorization conditions for Markov chain mixing [MT09].</p>
<p>Similar to other Lyapunov function based methods
in differential equations,
constructing such a function \(V\) is the main difficulty.
[MS14] observed that when \(F\) has only strict saddle points,
the choice of \(V = \exp\left( \frac{\beta}{2} F \right)\)
works very nicely away from saddle points.
In fact, we can directly compute the Lyapunov condition</p>
\[\frac{LV}{V} = \frac{1}{2} \Delta F - \frac{\beta}{4} |\nabla F|^2 \,,\]
<p>and observe that as long as \(|\nabla F|\) is bounded away from zero,
we can choose \(\beta\) to be large,
hence forcing \(\frac{LV}{V}\) to be negative.
In other words, excluding small neighbourhoods around saddle points,
\(\nu\) satisfies a Poincaré inequality.</p>
<p>The precise version of this result can be found in
[LE20, Lemma 9.10].
In particular, we can compute this constant
to be dimension and temperature free.</p>
<h2 id="proof-sketch---step-2-an-escape-time-construction">Proof Sketch - Step 2: An Escape Time Construction</h2>
<p>Now we have narrowed the problem down only constructing
a Lyapunov function for the neighbourhoods around saddle points,
we will observe replacing the inequality of the Lyapunov condition
with equality gives us the Poisson equation.
Consequently, we have a stochastic representation of
the solution in terms of the escape time.</p>
<hr />
<p><strong>Theorem [BdH16, Theorem 7.15 Adapted] [LE20, Corollary 9.3]</strong>
Let \(B \subset \mathbb{R}^d\), \(\{X_t\}_{t_{}\geq 0}\)
be the Langevin diffusion,
and \(\tau_{B^c}\) be the first escape time of \(X_t\) from \(B\).
<!-- -->
Suppose there exists a constant \(\theta>0\) such that</p>
\[V(x)
:= \mathbb{E} [ \, \exp( \theta \, \tau_{B^c} ) \,
| \, X_0 = x \, ]
< \infty \,, \quad \forall x \in B \,,\]
<p>then \(V\) is the unique solution to the Poisson equation</p>
\[\begin{split}
LV &= - \theta \, V \,, \quad & x \in B \,, \\
V &= 1 \,, \quad & x \in \partial B \,.
\end{split}\]
<hr />
<p>Readers with a Markov chain background may recognize
this escape time based condition to be equivalent to
drift and minorization [DMPS18, Theorem 14.1.3].
In fact, this method was inspired by the nice connection
drawn between diffusions and Markov chains.</p>
<p>Additionally, readers familiar with concentration inequalities
may recognize the theorem’s condition is known as
exponential integrability,
and it’s one of the equivalent characterizations
for sub-exponential random variables.
Indeed, we will actually use a slightly easier equivalent form
for calculations.</p>
<hr />
<p><strong>Theorem [Wai19, Theorem 2.13]</strong>
For a zero mean random variable \(\tau\), the following are equivalent:</p>
<ol>
<li>there exists a constant \(\theta > 0\) such that
\(\mathbb{E} \, e^{ \theta \tau } < \infty\),</li>
<li>there exists constants \(c,\theta > 0\) such that
\(\mathbb{P} [ \, |\tau| > t \, ] \leq c e^{ -2\theta t }\)
for all \(t \geq 0\).</li>
</ol>
<hr />
<p>At this point, it’s then sufficient to establish
an exponentially decaying tail bound for \(\tau_{B^c}\).
To this goal, we will make several observations:</p>
<ol>
<li>When \(F\) is sufficiently smooth,
\(F\) can be locally approximated by a quadratic function.</li>
<li>We only need to consider an escape via
a direction corresponding to a negative eigenvalue
of \(\nabla^2 F\).</li>
<li>When restricted within this direction only,
\(F\) near a saddle point can be viewed as a local maximum.</li>
</ol>
<p>To illustrate this point clearly,
let us consider the quadratic function \(f(x,y) = x^2 - \frac{\lambda}{2} y^2\)
with a saddle point at \((x,y) = (0,0)\).
For the Langevin diffusion to escape a neighbourhood of radius \(r>0\),
it’s sufficient to ensure the \(y\)-component exceeds \(r>0\).
Therefore, it’s sufficient to restrict \(f\) to only its \(y\)-component,
which makes \(y=0\) a local maximum.
Therefore it’s sufficient to study the Langevin diffusion restricted
to escaping an one dimension local maximum, i.e.</p>
\[dX_t = \lambda X_t \, dt + \sqrt{ 2/\beta } \, dW_t \,,\]
<p>where \(-\lambda\) upper bounds the smallest eigenvalue of
\(\nabla^2 F\) at saddle points.</p>
<p>We observe that this SDE is the “negative” Ornstein-Uhlenbeck process,
and it has a closed form solution</p>
\[X_t = X_0 e^{\lambda t}
+ \sqrt{2/\beta} \, \int_0^t e^{\lambda(t-s)} dW_s \,,\]
<p>which corresponds to
\(X_t \sim N( X_0 e^{\lambda t} \,,
\frac{1}{\lambda \beta}(e^{2\lambda t} - 1) )\).
Finally, plugging in the Gaussian density
and a few calculations later,
we get the desired result of</p>
\[\mathbb{P} [ \, |\tau_{B^c}| \geq t \, ]
\leq \mathbb{P} [ \, X_t \in B \, ]
\leq c e^{ -\lambda t } \,,\]
<p>where the constant \(c\) does not depend on \(t\).
I.e. this escape time tail bound implies that
\(V(x)\) is a valid Lyapunov function,
and hence implies a Poincaré inequality.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>Quite a few technical details were swept under the rug
to simplify the proof sketch, as the reader might expect.
Probably the most significant is the approximation
of \(F\) by a quadratic function -
it is actually not very straight forward to connect
an approximation bound to an escape time bound.</p>
<p>At the same time, the requirement of \(\beta\) to be
sufficiently large is quite unsatisfying.
Intuitively, why would adding noise hurt the mixing of
a Markov process?
It feels to me that this condition is merely a technical constraint,
and a more careful analysis could sharpen or remove this condition.
Hopefully the readers will have more thoughts and ideas than I do.</p>
<p>Thanks to reading up this point,
and I wish everyone a happy new year!</p>
<h2 id="references">References</h2>
<!-- - \[Bak08\] Y. Bakhtin, "Exit asymptotics for small diffusion about an unstable equilibrium." Stochastic processes and their applications 118.5 (2008): 839-851. -->
<ul>
<li>[BBCG08] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin, A simple proof of the poincar ́e inequality for a large class of probability measures, Electronic Communications in Probability 13 (2008), 60–66.</li>
<li>[Ber11] N. Berglund, Kramers’ Law: Validity, Derivations and Generalisations. arXiv preprint arXiv:1106.5799 (2011).</li>
<li>[BGL13] D. Bakry, I. Gentil, and M. LeDoux, Analysis and Geometry of Markov Diffusion Operators, Springer (2013).</li>
<li>[Che20] Y. Chen, An Almost Constant Lower Bound of the Isoperimetric Coefficient in the KLS Conjecture, arXiv preprint arXiv:2011.13661 (2020).</li>
<li>[DJL+17] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, B. Poczos, Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pp. 1067-1077 (2017).</li>
<li>[DMPS18] Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier, Markov chains, Springer, 2018.</li>
<li>[Eva10] L.C. Evans, Partial differential equations, Graduate studies in mathematics, American Mathematical Society (2010).</li>
<li>[JGN+17] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, & M.I. Jordan, How to escape saddle points efficiently, arXiv preprint arXiv:1703.00887 (2017).</li>
<li>[LE20] M.B. Li, M.A. Erdogdu, Riemannian Langevin Algorithm for Solving Semidefinite Programs, arXiv preprint arXiv:2010.11176 (2020).
<!-- - \[LV16\] Y.T. Lee, and S.S. Vempala, Eldan's Stochastic Localization and the KLS Conjecture: Isoperimetry, Concentration and Mixing, arXiv preprint arXiv:1612.01507 (2016). --></li>
<li>[LV18] Y.T. Lee, and S.S. Vempala, The Kannan-Lovász-Simonovits Conjecture, arXiv preprint arXiv:1807.03465 (2018).</li>
<li>[MS14] Georg Menz and André Schlichting, Poincaré and Logarithmic Sobolev Inequalities by Decomposition of the Energy Landscape, The Annals of Probability 42 (2014), no. 5, 1809–1884.</li>
<li>[MT09] Sean Meyn and Richard L. Tweedie, Markov chains and stochastic stability, 2nd ed., Cambridge University Press, USA, 2009.</li>
<li>[Vil06] C. Villani, Hypocoercivity, arXiv preprint math/0609050 (2006).</li>
<li>[Vil08] C. Villani, Optimal Transport: Old and New, Grundlehren der Mathematischen Wissenschaften, Springer Berlin Heidelberg (2008).</li>
<li>[Wai19] Martin J Wainwright, High-dimensional statistics: A non-asymptotic viewpoint, vol. 48, Cambridge University Press, 2019.</li>
</ul>{"name"=>"", "avatar"=>"Profile3.jpg", "bio"=>nil, "location"=>nil, "employer"=>nil, "pubmed"=>nil, "googlescholar"=>"https://scholar.google.com/citations?user=9dSlc_cAAAAJ", "email"=>"mufan.li@princeton.edu", "researchgate"=>nil, "uri"=>nil, "bitbucket"=>nil, "codepen"=>nil, "dribbble"=>nil, "flickr"=>nil, "facebook"=>nil, "foursquare"=>nil, "github"=>nil, "google_plus"=>nil, "keybase"=>nil, "instagram"=>nil, "impactstory"=>nil, "lastfm"=>nil, "linkedin"=>"mufan-bill-li-35749833", "orcid"=>nil, "pinterest"=>nil, "soundcloud"=>nil, "stackoverflow"=>nil, "steam"=>nil, "tumblr"=>nil, "twitter"=>"mufan_li", "vine"=>nil, "weibo"=>nil, "xing"=>nil, "youtube"=>nil, "wikipedia"=>nil}mufan.li@princeton.eduNobody has time to read an 80 page paper [LE20]. Therefore I doubt most readers realized the manifold Langevin algorithm paper actually contains a novel technique for establishing functional inequalities. And I really doubt anyone had time to interpret the intuitive consequences of such results on perturbed gradient descent, and definitely not extending the Kannan-Lovász-Simonovits (KLS) conjecture [LV18] - which brings me to write this blog post.The Auffinger-Chen Representation2019-02-13T00:00:00-05:002019-02-13T00:00:00-05:00https://mufan-li.github.io/auffinger_chen<p>Equivalent representation results contribute not only a connection between different concepts, but also a new set of proof techniques. Indeed, <a href="https://en.wikipedia.org/wiki/Stochastic_calculus">stochastic analysis</a> has offered a number of alternative proofs to many problems. Occasionally the proof can simplify drastically. In this post, we will discuss a particularly elegant application by Auffinger and Chen (2015), for an otherwise very difficult problem in spin glass.</p>
<h2 id="the-problem-statement">The Problem Statement</h2>
<p>For the sake of writing a self-contained blog post, we will not attempt to provide a description of spin glass models. Instead, we will state the problem in the most mathematically interesting form, without explaining where the quantities came from.</p>
<p>Let \(\xi:[0,1]\to \mathbb{R}\) be twice differentiable and strictly increasing and strictly convex (i.e. \(\xi', \xi'' > 0\)), also let \(\zeta:[0,1] \to [0,1]\) be a cumulative distribution function (CDF). We will consider the Parisi partial differential equation (PDE) defined as follows</p>
\[\begin{cases}
\partial_t \Phi = \frac{ - \xi''(t) }{2} \left[
\partial_{xx} \Phi +
\zeta(t) \left( \partial_x \Phi \right)^2
\right], \\
\Phi(1,x) = \log \cosh(x) \,,
\end{cases}\]
<p>where the time derivative is defined by the right limit for when \(\zeta(t)\) is discontinuous.</p>
<p>It is well known that we can solve this PDE backwards in time using a <a href="https://en.wikipedia.org/wiki/Burgers%27_equation#Solution_of_viscous_Burgers'_equation">Hopf-Cole transformation</a>; in fact, we will provide a sketch in a <a href="#sketch-of-strict-convexity">a later section</a>. This allows us to state an optimization objective as follows:</p>
\[\inf_{\zeta} \Phi_\zeta(0,x),\]
<p>where we are minimizing over the set of all CDFs on \([0,1]\) for each \(x \in \mathbb{R}\). Finally we can state the question as follows:</p>
<p><strong>Question</strong> Does there exist a unique minimizer to the optimization problem \(\inf_{\zeta} \Phi_\zeta(0,x)\) for each \(x\in\mathbb{R}\)?</p>
<p>The main difficulty comes from the unclear dependence on \(\zeta\), even if we can write down a closed form solution to the Parisi PDE. At the very least, it would be extremely unpleasant and tedious to work with. Additionally, we remark that the problem is already stated in a simplified form, as opposed to the original framing in spin-glass.</p>
<p>Before we jump into the main results, we observe that existence of a minimizer is straight forward to prove. Since we are restricted to the domain \([0,1]\), any sequence of probability measures is <a href="https://en.wikipedia.org/wiki/Tightness_of_measures">tight</a>. It is then sufficient to consider any sequence of probability measures \(\{\zeta_n\}\) that minimizes \(\Phi_\zeta(0,x)\), and tightness implies there exist a converging subsequence such that \(\zeta_{n_k} \to \zeta^*\) weakly, which is a minimizer of \(\Phi(0,x)\).</p>
<h2 id="the-auffinger-chen-representation">The Auffinger-Chen Representation</h2>
<p>To complete the proof, it is sufficient to show \(\Phi(0,x)\) is strictly convex in \(\zeta\). In this section, we will use a stochastic representation to show convexity, which is the main difficulty of the problem. Readers unfamiliar with stochastic analysis can find a brief introduction in a <a href="https://mufan-li.github.io/stone_ito/">previous blog post</a>, in particular we will use <a href="https://en.wikipedia.org/wiki/It%C3%B4%27s_lemma">Itô’s Lemma</a> in the upcoming proofs.</p>
<p>We start by defining \(B_t := W_{\xi'(t)}\), where \(\{W_t\}\) is a standard Brownian motion. Let \(\{\mathcal{F}_t\}_{t\geq 0}\) be \(\{W_t\}\)’s canonical filtration, and then we define a collection of processes</p>
\[\mathcal{D} := \left\{ (u_t)_{0 \leq t \leq 1}
: u_t \text{ is adapted to } \mathcal{F}_t,
|u_t| \leq 1
\right\}.\]
<p>For simplicity of notation, we will write \(\sigma(t) = \sqrt{\xi''(t)}\) for this section. At this point we will state the main result.</p>
<hr />
<p><strong>Theorem (Auffinger-Chen Representation)</strong>
For all \(\zeta\) a probability distribution on \([0,1]\), we have the following</p>
\[\begin{split}
\Phi(0,x) = \max_{u \in \mathcal{D}} \bigg[
\mathbb{E} & \Phi\left(1,
x + \int_0^1 \sigma^2(s) \, \zeta(s) \, u_s \, ds
+ \int_0^1 \sigma(s) \, dW_s
\right) \\
&- \frac{1}{2} \int_0^1 \sigma^2(s) \, \zeta(s) \,
\mathbb{E} u_s^2 \, ds
\bigg].
\end{split}\]
<p>In particular, we have the maximizer is unique, and is given by \(u_s = \partial_x \Phi(s, x + X_s)\), where \(X_s\) is the strong solution of the following stochastic differential equation (SDE)</p>
\[dX_s = \sigma^2(s) \, \zeta(s) \, \partial_x \Phi(s, x + X_s) \, ds
+ \sigma(s) \, dW_s,
\quad X_0 = 0.\]
<hr />
<p><strong>Remark</strong> Before we begin the proof, we will observe that \(\Phi(0,x)\)’s convexity <strong>follows directly from this representation</strong>. Firstly both integral terms containing \(\zeta\) are linear in \(\zeta\). Since \(\Phi(1,x) = \log \cosh (x)\) is convex in \(x\), we have the \(\Phi\) term is convex in \(\zeta\). Next the expectation over the sum of two convex functions remain convex. Finally, a maximum (or supremum) over convex functions remain convex, proving the desired convexity result!</p>
<p>Before we start, we will state several technical (but not difficult to prove) Lemmas. To guarantee a <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation#Existence_and_uniqueness_of_solutions">strong solution of the SDE</a>, it is sufficient to have \(\partial_x \Phi(s, x)\) be Lipschitz in \(x\). We will omit the proof of these results as they are not important to the main goal of this blog post. Instead we will state the following Lemma containing the desired estimates.</p>
<hr />
<p><strong>Lemma (Derivative Estimates)</strong>
For all \(\zeta\) probability distributions on \([0,1]\), we have that</p>
\[|\partial_x \Phi(t, x)| \leq 1,
|\partial_{xx} \Phi(t,x)| \leq 1.\]
<hr />
<p>Another important result we will omit is the continuity of \(\Phi\) in \(\zeta\).</p>
<hr />
<p><strong>Lemma (Lipschitz in \(L^1\))</strong>
For any discrete distributions \(\zeta_1, \zeta_2\),
and for all \(k \in \mathbb{N}\),
we have that</p>
\[\begin{split}
\left| \Phi_{\zeta_1} - \Phi_{\zeta_2} \right|
&\leq \xi''(1) \int_0^1 |\zeta_1(t) - \zeta_2(t)| dt, \\
\left| \partial_x^k \Phi_{\zeta_1}(t,x) -
\partial_x^k \Phi_{\zeta_2}(t,x)
\right|
&\leq c_k \, \xi''(1) \int_0^1 |\zeta_1(t) - \zeta_2(t)| dt.
\end{split}\]
<hr />
<p>Since we can approximate any distributions in \(L^1\) by discrete distributions, then we can extend the definition of \(\mathcal{P}(\cdot)\) and \(\Phi(t,x)\) to all distributions by continuity. Therefore it is sufficient to prove the result for only finitely supported distributions.</p>
<p><em>proof (of the Auffinger-Chen representation):</em>
The proof will be a straight forward application of Itô’s Lemma,
and the results follow almost directly from invoking the Parisi PDE.</p>
<p>We start with discrete \(\zeta\), i.e \(\zeta\) is a piecewise constant function. Let \(u \in \mathcal{D}\), and define</p>
\[dX_s := \sigma^2(s) \, \zeta(s) \, u_s \, ds
+ \sigma(s) \, dW_s,
\quad X_0 = 0,\]
<p>and let \(Y_s := \Phi(s, x + X_s)\). Then we observe that</p>
\[X_1 = \int_0^1 \sigma^2(s) \, \zeta(s) \, u_s \, ds
+ \int_0^1 \sigma(s) \, dW_s\]
<p>appears exactly inside the first \(\Phi\) term of the Auffinger-Chen representation.</p>
<p>At this point we adopt concise notation and write
\(\Phi := \Phi(s, x + X_s)\),
and apply Itô’s Lemma to \(Y_s\) to get</p>
\[dY_s = \left[ \partial_s \Phi + \sigma^2(s) \, \zeta(s) \, u_s \,
\partial_x \Phi + \frac{1}{2} \sigma^2(s) \, \partial_{xx} \Phi
\right] ds
+ \sigma(s) \partial_x \Phi \, dW_s.\]
<p>Here we note while the time derivative \(\partial_s \Phi\)
does not exist at finitely many points,
we will eventually only use it in integral form.
Using the Parisi PDE at points of continuity,
we can make the following substitution</p>
\[\partial_s \Phi + \frac{1}{2} \sigma^2(s) \partial_{xx} \Phi
= - \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (\partial_x \Phi)^2.\]
<p>We will make the substitution and complete the square to get</p>
\[\begin{split}
dY_s &= \left[ \sigma^2(s) \, \zeta(s) \, u_s \, \partial_x \Phi
- \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (\partial_x \Phi)^2
\right] ds
+ \sigma(s) \, \partial_x \Phi \, dW_s \\
&= \left[ \frac{1}{2} \sigma^2(s) \, \zeta(s) \, u_s^2
- \frac{1}{2} \sigma^2(s) \, \zeta(s) \,
(u_s - \partial_x \Phi )^2
\right] ds
+ \sigma(s) \, \partial_x \Phi \, dW_s.
\end{split}\]
<p>Next we write this equation as an integral over \([0,1]\),
and taking expectation to remove the martingale term we get</p>
\[\begin{split}
\mathbb{E} \Phi(1, x + X_1) - \Phi(0,x)
=& \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \,
\mathbb{E} u_s^2 \, ds \\
&- \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \,
\mathbb{E} (u_s - \partial_x \Phi)^2 ds.
\end{split}\]
<p>Since \(\Phi, \partial_x \Phi\) are continuous in \(\zeta\),
we can extend this equation to all \(\zeta\).
Furthermore, since the second integral is always positive,
we must have the inequality</p>
\[\Phi(0,x) \geq \mathbb{E} \Phi(1, x + X_1)
- \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \,
\mathbb{E} u_s^2 \, ds,\]
<p>and the inequality must be strict unless
\(u_s = \partial_x \Phi\) almost surely.</p>
<p>Observe this proves the inequality of the representation.
Since \(|\partial_x \Phi| \leq 1\),
we have \(u_s = \partial_x \Phi \in \mathcal{D}\),
hence achieving the equality in the representation.</p>
\[\tag*{$\Box$}\]
<h2 id="sketch-of-strict-convexity">Sketch of Strict Convexity</h2>
<p>At this point, the author believes the goal of the blog post is already achieved: we have demonstrated the key technique with only very basic manipulations. That being said, to complete the story, we will provide a short sketch on how to prove strict convexity - hence proving there is a unique minimizer of \(\Phi_\zeta(0,x)\).</p>
<p>We once again start with a key technical lemma.</p>
<hr />
<p><strong>Lemma (Strict Convexity in \(x\))</strong>
For all \(\zeta\) a probability distribution on \([0,1]\),
and for all \(s \in [0,1]\), we have</p>
\[\partial_{xx} \Phi(s,x) > 0.\]
<hr />
<p>Here we remind the reader that strict convexity in \(x\) does not directly imply strict convexity in \(\zeta\). We could just take this result for granted, but there is a nice proof using the Hopf-Cole transform and another stochastic representation, so why not?</p>
<p><em>sketch (of Lemma):</em> Since \(\Phi(t,x)\) is continuous in \(\zeta\), we will only consider a discrete \(\zeta\). Then using an appropriate time change and time reversal, we can get a new PDE</p>
\[\partial_t \Phi = \frac{1}{2 \widehat \zeta(t)} \partial_{xx} \Phi
+ \frac{1}{2} (\partial_x \Phi)^2,\]
<p>with <strong>initial conditions</strong> (as opposed to terminal conditions)
\(\Phi(0,x)=\log \cosh(x)\), and \(\widehat \zeta(t) = \zeta(1 - t)\) changed due to time reversal.
To simplify the PDE, we use the Hopf-Cole transformation to substitute
\(\phi = \exp\left( \frac{\Phi}{\widehat\zeta(t)} \right)\), which leads to the simplified linear PDE</p>
\[\partial_t \phi = \frac{1}{2 \widehat \zeta(t)} \partial_{xx} \phi,\]
<p>with initial conditions \(\phi(0,x) = \frac{1}{\widehat\zeta(t)} \log \exp\left( \frac{\log \cosh(x)}{\widehat \zeta(t)} \right) = \cosh(x)^{1/\widehat \zeta(t)}\). Using another time change, we can also remove the \(\widehat \zeta(t)\) above.</p>
<p>Here we can use any of the reader’s favourite method: the <a href="https://en.wikipedia.org/wiki/Feynman%E2%80%93Kac_formula">Feynman-Kac Representation</a>, the <a href="https://en.wikipedia.org/wiki/Kolmogorov_backward_equations_(diffusion)">Kolmogorov backward equation</a>, or the <a href="https://en.wikipedia.org/wiki/Heat_kernel">heat kernel</a> to write</p>
\[\phi(t,x) = \mathbb{E} \cosh( x + W_t )^{1/\widehat \zeta(t)},\]
<p>where \(W_t\) is a standard Brownian motion, and \(\widehat \zeta\) is constant in \([0,t)\). At this point it is sufficient to show strict convexity for this \(t\), since we can piece together the constant intervals later. To this end, we will write</p>
\[\Phi(t,x) = \frac{1}{\widehat \zeta(t)} \log \mathbb{E} \cosh(x + W_t)^{1/\widehat \zeta(t)},\]
<p>and define</p>
\[\langle f(W_t) \rangle := \frac{
\mathbb{E} f(W_t) \cosh(x + W_t)^{1/\widehat \zeta(t)}
}{
\mathbb{E} \cosh(x + W_t)^{1/\widehat \zeta(t)}
} \, ,\]
<p>where we observe since \(\cosh(x) > 0\) and \(\mathbb{E}\cosh(x + W_t) < \infty\), we have that \(\langle \cdot \rangle\) defines a new probability measure. In particular, we have Jensen’s inequality.</p>
<p>With this we can take the second derivative of \(\Phi\) to get</p>
\[\begin{split}
\partial_{xx} \Phi(t,x) &=
\frac{-1}{\widehat \zeta(t)} \left\langle
\frac{1}{\widehat \zeta(t)} \tanh(x + W_t)
\right\rangle^2 \\
& \quad+
\frac{1}{\widehat \zeta(t)}
\left\langle \left( \frac{1}{\widehat \zeta(t)} \tanh(x + W_t) \right)^2
+ \frac{1}{\widehat \zeta(t)} \left( 1 - \tanh(x + W_t)^2 \right)
\right\rangle \\
&\geq \frac{1}{\widehat \zeta(t)}
\left\langle \frac{1}{\widehat \zeta(t)} \left( 1 - \tanh(x + W_t)^2 \right)
\right\rangle \\
&> 0 \, ,
\end{split}\]
<p>where we used Jensen’s inequality and the fact that \(\tanh(x)^2 < 1\).</p>
<!--
Observe that since $$\widehat \zeta(t) \in [0,1]$$, the second term in the square brackets $$[\cdots]$$ are always greater than 1. Then we can use Jensen's inequality to write
$$ \partial_{xx} \Phi(t,x) \geq
\frac{1}{\widehat \zeta(t)} \mathbb{E} \frac{1}{\widehat \zeta(t)} \log \cosh (x + W_t).
$$
Since $$\log \cosh(x) > 0$$ unless $$x = 0$$, we have the right hand side must be strictly positive, hence implying $$\partial_{xx}\Phi(t,x) > 0$$.
-->
<hr />
<p>Finally we return to strict convexity in \(\zeta\).</p>
<p><em>sketch (of Strict Convexity in \(\zeta\)):</em>
We will start by introducing quantities related to convexity.
Let \(\zeta_1 \neq \zeta_2\),
and let \(\zeta = \lambda \zeta_1 + (1-\lambda) \zeta_2\)
for some \(\lambda \in (0,1)\).
Recalling \(\Phi(1,x) = \log \cosh (x)\),
and using the optimal \(u_s = \partial_x \Phi_\zeta(s, x + X_s)\),
where \(X_s\) defined with respect to \(\zeta\).
Note this \(u_s\) is not necessarily optimal for
\(\zeta_1, \zeta_2\).</p>
<p>Since \(\log\cosh(x)\) is convex, we can write</p>
\[\Phi_\zeta(0,x) \leq \lambda_1 A_1 + \lambda_2 A_2,\]
<p>where each \(A_i\) is defined as</p>
\[\begin{split}
A_i := \mathbb{E}& \, \log \cosh \left(
x + \int_0^1 \sigma^2(s) \, \zeta_i(s) \, u_s \, ds
+ \int_0^1 \sigma(s) dW_s
\right) \\
& - \frac{1}{2} \int_0^1 \sigma^2(s) \, \zeta_i(s) \,
\mathbb{E} u_s^2 \, ds.
\end{split}\]
<p>Since \(\log \cosh(x)\) is strictly convex,
the inequality is strict unless</p>
\[\int_0^1 \sigma^2(s) \, \zeta_1(s) \, u_s \, ds
= \int_0^1 \sigma^2(s) \, \zeta_2(s) \, u_s \, ds,\]
<p>almost surely.
Using the Auffinger-Chen representation, we have that
\(A_i \leq \Phi_{\zeta_i}(0,x)\).
Therefore to prove the convexity is strict,
it is sufficient to prove a gap in the first inequality,
which is equivalent to saying that</p>
\[Z := \int_0^1 \sigma^2(s) \, (\zeta_1(s) - \zeta_2(s)) \, u_s \, ds\]
<p>has positive variance.
The variance can be computed as</p>
\[\text{Var}(Z) = \int_0^1 \int_0^1 \varphi(s) \, \varphi(t)
\, \text{Cov}(u_s, u_t) \, ds dt,\]
<p>where \(\varphi(s) = \sigma^2(s) \, (\zeta_1(s) - \zeta_2(s))\).</p>
<p>While we omit the technical details,
it’s not hard to believe \(u_s = \partial_x \Phi(s, x + X_s)\)
satisfy the following SDE
(from Itô’s Lemma and differentiating the Parisi PDE)</p>
\[du_s = \sigma(s) \partial_{xx} \Phi(s, x + X_s) dW_s.\]
<p>Observing that \(u_s\) is a martingale with independent increments,
we can compute \(\text{Cov}(u_s, u_t)\) as</p>
\[\text{Cov}(u_s, u_t) = \text{Var}(u_{s \wedge t})
= \int_0^{s \wedge t} \sigma^2(v) \mathbb{E}
(\partial_{xx} \Phi(v, x + X_v))^2 dv,\]
<p>where the last step followed from
<a href="https://en.wikipedia.org/wiki/It%C3%B4_isometry">Itô’s Isometry</a>.
Defining \(\tau(s) := \text{Var}(u_s)\),
we can also write \(\text{Cov}(u_s, u_t) = \tau(s) \wedge \tau(t)\).
With a bit of algebra we can derive</p>
\[\text{Var}(Z) = \int_0^1 \left( \int_v^1 \varphi(s) ds \right)^2
\tau'(v) dv.\]
<p>Since \(\tau'(v) = \sigma^2(v) \mathbb{E}
(\partial_{xx} \Phi(v, x + X_v))^2\),
the desired result follows from the fact \(\partial_{xx} \Phi > 0\).</p>
<h2 id="final-comments">Final Comments</h2>
<p>Recall the original problem of \(\inf_\zeta \Phi_\zeta(0,x)\).
We have shown that while the dependence structure is unclear,
we are able to prove its convexity with it easily
using a stochastic representation.
The author would like to point out that
most techniques used here are quite basic,
which is surprising for an originally very difficult problem.</p>
<p>The author would also like to point to a more general
variational stochastic representation by
<a href="https://projecteuclid.org/euclid.aop/1022855876">Boué and Dupuis (1998)</a>,
perhaps more useful for other applications.</p>
<p>Finally the post would not be possible without attending
an excellent graduate course on spin glass taught by Dmitry Panchenko,
where he has done a much better job explaining this topic.
In particular, Dmitry has written an excellent book (Panchenko, 2013) with a bonus chapter covering this topic that can be found <a href="https://drive.google.com/file/d/0B6JeBUquZ5BwRFpLVjdVd3IwV1E/view?usp=drive_open">online</a>.
I would also highly recommends Dmitry’s
<a href="https://sites.google.com/site/panchenkomath/lecture-notes">notes on probability theory</a>,
which has been in general very helpful to the author’s
studies and research.</p>
<h2 id="references">References</h2>
<ul>
<li>Auffinger, A., & Chen, W. K. (2015). The Parisi formula has a unique minimizer. Communications in Mathematical Physics, 335(3), 1429-1444.</li>
<li>Boué, M., & Dupuis, P. (1998). A variational representation for certain functionals of Brownian motion. The Annals of Probability, 26(4), 1641-1659.</li>
<li>Panchenko, D. (2013). The Sherrington-Kirkpatrick model. Springer Science & Business Media.</li>
</ul>{"name"=>"", "avatar"=>"Profile3.jpg", "bio"=>nil, "location"=>nil, "employer"=>nil, "pubmed"=>nil, "googlescholar"=>"https://scholar.google.com/citations?user=9dSlc_cAAAAJ", "email"=>"mufan.li@princeton.edu", "researchgate"=>nil, "uri"=>nil, "bitbucket"=>nil, "codepen"=>nil, "dribbble"=>nil, "flickr"=>nil, "facebook"=>nil, "foursquare"=>nil, "github"=>nil, "google_plus"=>nil, "keybase"=>nil, "instagram"=>nil, "impactstory"=>nil, "lastfm"=>nil, "linkedin"=>"mufan-bill-li-35749833", "orcid"=>nil, "pinterest"=>nil, "soundcloud"=>nil, "stackoverflow"=>nil, "steam"=>nil, "tumblr"=>nil, "twitter"=>"mufan_li", "vine"=>nil, "weibo"=>nil, "xing"=>nil, "youtube"=>nil, "wikipedia"=>nil}mufan.li@princeton.eduEquivalent representation results contribute not only a connection between different concepts, but also a new set of proof techniques. Indeed, stochastic analysis has offered a number of alternative proofs to many problems. Occasionally the proof can simplify drastically. In this post, we will discuss a particularly elegant application by Auffinger and Chen (2015), for an otherwise very difficult problem in spin glass.Stone-Weierstrass and an Alternative Proof of Itô’s Lemma2018-07-15T00:00:00-04:002018-07-15T00:00:00-04:00https://mufan-li.github.io/stone_ito<p>In a similar sense to
<a href="https://en.wikipedia.org/wiki/Line_integral">line integrals</a>,
<a href="https://en.wikipedia.org/wiki/Stochastic_calculus">stochastic calculus</a>
extends the classical tools to working with
<a href="https://en.wikipedia.org/wiki/Stochastic_process">stochastic processes</a>.
One of the most elegant and useful result
is the change of variable formula
for stochastic integrals,
commonly known as <a href="https://en.wikipedia.org/wiki/It%C3%B4%27s_lemma">Itô’s Lemma</a>
(see
<a href="#an-interesting-story-to-wrap-up">end of this post</a>
for a discussion on Doeblin’s
contribution).
While this lemma is quite easy to use,
the proof usually relies heavily on
technical lemmas,
hence difficult to develop intuition,
especially for the first time reader.</p>
<p>With this motivation in mind,
it was quite pleasant to discover a set of excellent
<a href="http://statslab.cam.ac.uk/~jpm205/teaching/lent2016/lecture_notes.pdf">lecture notes</a>
by Jason Miller (2016),
which contained an alternative proof
built on the idea of
<a href="https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem">Stone-Weierstrass Theorem</a>.
We shall see that not only do we have
a more interpretable proof,
the technique is also generalizable
beyond stochastic calculus.
In particular,
this blog post intends to illustrate
the technique in detail through Itô’s Lemma.</p>
<h2 id="a-brief-background-on-stochastic-calculus">A Brief Background on Stochastic Calculus</h2>
<p>We will introduce (without too much rigour)
some basic definitions and results
to support the proofs in later sections.
The reader need not to carefully analyze
the technical details here to understand the proofs to come.
Readers familiar with stochastic calculus may
<a href="#overview-of-the-alternative-approach">
skip to the next section</a>.</p>
<p>First we let
\((\Omega, \mathcal{F}, \{\mathcal{F}_t\}_{t\geq 0}, \mathbb{P})\)
be a
<a href="https://en.wikipedia.org/wiki/Probability_space">probability space</a>
equipped with a
<a href="https://en.wikipedia.org/wiki/Filtration_(probability_theory)">filtration</a>
(also satisfying the <a href="https://en.wikipedia.org/wiki/Usual_hypotheses">usual conditions</a>
to be rigorous).
With this we can define several useful objects.</p>
<p><strong>Definition</strong>
A stochastic process \(X := \{X_t\}_{t\geq 0}\)
is said to be a <strong>martingale</strong> if</p>
<p>(i) \(\forall t \geq 0\), we have \(X_t\) is measurable
with respect to \(\mathcal{F}_t\), denoted
\(X_t \in \mathcal{F}_t\);</p>
<p>(ii) \(\forall 0 \leq s \leq t\), we have
\(\mathbb{E}[ X_t | \mathcal{F}_s ] = X_s\) a.s.</p>
<p><strong>Definition</strong> We say a random variable \(\tau:\Omega \to [0,\infty]\)
is a <strong>stopping time</strong> if \(\forall t \geq 0,
\{\tau \leq t \} \in \mathcal{F}_t\).</p>
<p>An important property of stopping time is that
if \(X_t\) is a martingale and \(\tau\) a stopping time,
then \(X_{t \wedge \tau}\) is also a martingale.</p>
<p><strong>Definition</strong>
Let the interval \([0,T]\)
be partitioned using increments of \(2^{-n}\),
i.e. \(\{t_k^n\}_{k=0}^{\lceil T 2^n \rceil}\),
where \(t_k^n = k 2^{-n} \wedge T\).
Let \(X_t\) be a continuous martingale,
and \(f_t\) be a continuous (possibly stochastic) process.
We define the <strong>Itô integral</strong> as</p>
<p>\[ \int_0^T f_t \, dX_t :=
\lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor}
f_{t_k^n} (X_{t_{k+1}^n} - X_{t_k^n}),
\]</p>
<p>if the limit converges u.c.p.
(<a href="https://almostsure.wordpress.com/2009/12/22/u-c-p-convergence/">uniformly on compact intervals in probability</a> to be precise).</p>
<p><strong>Remark</strong> Observe the above definition uses
a <strong>left Riemann sum</strong> to define the integral,
where as <em>other choices</em> will lead to <em>different</em> integrals.
This is opposed to deterministic integrals,
where the all choices are equivalent.</p>
<p><strong>Definition</strong> Consider the same partition
\(\{t_k^n\}\) as above.
Let \(M,N\) be two continuous martingales,
we define the <strong>quadratic covariation</strong> as</p>
<p>\[ [M,N]_T :=
\lim_{n\to\infty} [M,N]^n_T :=
\lim_{n\to\infty}
\sum_{k=0}^{\lfloor T 2^n \rfloor}
(M_{t_{k+1}^n} - M_{t_k^n}) (N_{t_{k+1}^n} - N_{t_k^n}),
\]</p>
<p>where the limit is also u.c.p.
We also define the <strong>quadratic variation</strong> as
\([M]_T := [M,M]_T\).</p>
<!-- **Definition** A stochastic process $$X_t$$ is said to
**locally bounded** if there exists a sequence $$\{S_n\}$$
of stopping times such that $$S_n \to \infty$$ a.s.,
and $$X_{t\wedge S_n}$$ is bounded for all $$n$$.
-->
<p>Several useful results are stated next.</p>
<hr />
<p><strong>Proposition (Finite Variation)</strong>
Let \(X,Y\) be continuous stochastic processes such that
\(X\) has finite variation, i.e.</p>
\[\lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor}
| X_{t_{k+1}^n} - X_{t_k^n} | < \infty,\]
<p>and \([Y]_t > 0\) a.s. Then we have</p>
\[[X,Y]_t = 0 \;\text{a.s.}\]
<p><strong>Proposition (Itô’s Product Rule)</strong>
Let \(X,Y\) be continuous martingales,
then we have</p>
\[X_t Y_t - X_0 Y_0 = \int_0^t X_s dY_s
+ \int_0^t Y_s dX_s + [X,Y]_t \,.\]
<p><strong>Proposition (Fundamental Theorem)</strong>
Let \(X,Y,Z\) be continuous martingales,
then we have</p>
\[\int_0^t X_s d\left( \int_0^s Y_u dZ_u \right)
= \int_0^t X_s Y_s dZ_s.\]
<p><strong>Proposition (Kunita-Watanabe Identity)</strong>
Let \(X,Y,Z\) be continuous martingales, then we have</p>
\[\left[ \int_0 X_s dY_s, Z \right]_t
= \int_0^t X_s d[Y,Z]_s,\]
<p>where both uses of \([\;,\;]_t\) denotes the covariation.</p>
<p><strong>Proposition (Itô’s Isometry)</strong></p>
<p>Let \(M\) be a continuous martingale,
and \(H\) be a continuous stochastic process.
Then we have</p>
\[\mathbb{E} \left[ \left( \int_0^t H_s dM_s
\right)^2 \right]
= \mathbb{E} \int_0^t H_s^2 d[M]_s.\]
<hr />
<h2 id="the-lemma-and-the-classical-approach">The Lemma and the Classical Approach</h2>
<p>For the purpose of the blog post,
we will only state and prove a much simpler version
of the lemma,
but it is not difficult to adapt to more general conditions.</p>
<hr />
<p><strong>Theorem (Itô’s Lemma)</strong>
Let \(X_t\) be a continuous martingale,
and \(f \in C^2(\mathbb{R})\).
Then we have</p>
<p>\[ f(X_t) = f(X_0)
+ \int_0^t \frac{\partial f}{\partial x}(X_s) dX_s
+ \frac{1}{2} \int_0^t \frac{\partial^2 f}{\partial x^2}
(X_s) d[X]_s.
\]</p>
<hr />
<p>Here we will sketch the proof from
Karatzas and Shreve (1991).</p>
<p><em>proof sketch:</em>
We start by defining a stopping time
\(\tau_r := \inf \{t \geq 0 : |X_t| + [X]_t > r\}\),
and replace \(X_t\) with \(X_{t \wedge \tau_r}\).
This <em>localization</em> technique will allow us
to only consider the function \(f\) in
the interval \(B_r := [-r, r]\)
(or a ball in higher dimensions),
which has bounded derivatives.</p>
<p>By observing the lemma’s statement,
the reader may notice the formula appears like
the second order Taylor expansion of \(f(X_t)\).
Indeed we can write</p>
\[\begin{align*}
f(X_t) - f(X_0) =& \lim_{n\to\infty}
\sum_{k=0}^{\lfloor t 2^n \rfloor}
f(X_{t_{k+1}^n}) - f(X_{t_{k}^n}) \\
=& \lim_{n\to\infty} \sum_{k=0}^{\lfloor t 2^n \rfloor}
\Big\{
\frac{\partial f}{\partial x}(X_{t_{k}^n})
[X_{t_{k+1}^n} - X_{t_{k}^n}] \\
&+ \frac{1}{2} \frac{\partial^2 f}{\partial x^2}
(\eta_k^n) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2
\Big\},
\end{align*}\]
<p>where \(\eta_k^n \in [X_{t_{k}^n}, X_{t_{k+1}^n}]\)
is chosen as part of Taylor’s theorem
to satisfy the above equality.
It’s not difficult to see the first sum converges to
the first stochastic integral,
then it remains to show the second term converges.</p>
<p>To this goal,
we will define</p>
\[\begin{align*}
J_1^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor}
\frac{\partial^2 f}{\partial x^2}
(\eta_k^n) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2, \\
J_2^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor}
\frac{\partial^2 f}{\partial x^2}
(X_{t_{k}^n}) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2, \\
J_3^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor}
\frac{\partial^2 f}{\partial x^2}
(X_{t_{k}^n}) \{ [X]_{t_{k+1}^n} - [X]_{t_{k}^n} \},
\end{align*}\]
<p>where observe \(J_3^n\) converges to the desired integral.
Next we will use the following technical inequality.
Let \(|X_s| \leq K < \infty, \forall s \leq T\)
be a martingale,
then we have</p>
\[\mathbb{E} ([X]^n_T)^2 \leq 6 K^4.\]
<p>Without stating the details,
using this and Cauchy-Schwarz inequality,
we can show</p>
\[\lim_{n\to\infty} |J_1^n - J_2^n| = 0 \; \text{a.s.}\]
<p>To complete the proof, we will need one more technical lemma.
Let \(|X_s| \leq K < \infty, \forall s \leq T\),
then we have</p>
\[\lim_{n\to\infty} \mathbb{E} \sum_{k=0}^{\lfloor t 2^n \rfloor}
[ X_{t_{k+1}^n} - X_{t_k^n} ]^4
= 0.\]
<p>Then once again omitting the details,
we can get</p>
\[\mathbb{E} |J_2^n - J_3^n| \leq
2 \sup_{x \in B_r} \left|
\frac{\partial^2 f}{\partial x^2}(x)
\right|^2
\mathbb{E} \left[ \sum_{k=0}^{\lfloor t 2^n \rfloor}
[ X_{t_{k+1}^n} - X_{t_k^n} ]^4
+ [X]_t \max_{k} ( [X]_{t_{k+1}^n} - [X]_{t_{k}^n} )
\right],\]
<p>which combined with the previous lemma and
bounded convergence theorem,
we get the desired result</p>
\[\lim_{n\to\infty} |J_2^n - J_3^n| = 0 \; \text{a.s.}\]
<p>Putting everything together gives us the desired formula
as stated.</p>
<p><strong>Remark</strong> The use of the propositions
listed in the previous section is implicit
in the two technical lemmas we stated above,
where we also hide most of the proof difficulty in.</p>
<p><strong>Interpretation</strong> This proof naturally leads to
an interpretation that Itô’s Lemma as
a consequence of Taylor’s expansion.
However this proof provides no clear intuition on why
the second order approximation is the correct order,
and pushes the justification to complicated technical details.
Probably the most troubling consequence is that
a different integration scheme
(e.g. <a href="https://en.wikipedia.org/wiki/Stratonovich_integral">Stratonovich</a>
which rises from a mid-point Riemann sum)
leads to a different change of variable formula,
therefore the Taylor expansion intuition can lead
to further confusion.</p>
<h2 id="overview-of-the-alternative-approach">Overview of the Alternative Approach</h2>
<p>At this point, we will first take a step back from
Itô’s Lemma and look at a rough sketch of the proof technique.</p>
<p>Suppose we want to prove a collection of functions
(e.g. \(C^2([a,b])\)) satisfy a certain property \((P)\),
we will start by defining \(\mathcal{A}\)
as the subset of \(C^2([a,b])\)
that satisfies the desired property \((P)\).</p>
<p>(Step 1)
We will identify a certain algebraic structure
such that \(\mathcal{A}\) is closed under,
e.g. for an <a href="https://en.wikipedia.org/wiki/Algebra_over_a_field">algebra (over a field)</a>
we have if \(f,g \in \mathcal{A}\), then
\(cf + g, fg \in \mathcal{A}\).
In other words, an algebra
is a vector space with an associative vector multiplication.</p>
<p>(Step 2)
Then we can say that
the collection \(\mathcal{A}\) (or a dense subset)
is generated by some very simple functions,
e.g. under an algebra,
the functions \(\{1, x\}\)
generate the entire collection of polynomials.</p>
<p>(Step 3)
At this point, we use a density argument
such as Weierstrass approximation to show
\(\mathcal{A}\) is dense in \(C^2([a,b])\).
Specifically, \(\forall f \in C^2([a,b])\),
\(\exists \{f_n\}_{n \geq 1} \subset \mathcal{A}\)
such that \(f_n \to f\) with respect to some metric \(\rho\).</p>
<p>(Step 4)
Finally, it is sufficient to show \(\mathcal{A}\)
is closed under this metric \(\rho\).
I.e. if \(\{f_n\}_{n \geq 1}\) all satisfy \((P)\)
are such that \(f_n \to f\) in \(\rho\),
then we have \(f\) also satisfies \((P)\),
hence \(f \in \mathcal{A}\).</p>
<p><strong>Remark</strong>
The reader may already recognize that
the sketch above was intentionally phrased
in a very general sense,
so we can observe the flexibility of the technique.
In fact we can even generalize beyond function spaces,
as long as we have an equivalent approximation technique.</p>
<h2 id="the-proof-in-detail">The Proof in Detail</h2>
<p>We start by stating the key theorem.</p>
<hr />
<p><strong>Theorem (Stone-Weierstrass, Real Numbers)</strong>
Let \(S\) be a compact
<a href="https://en.wikipedia.org/wiki/Hausdorff_space">Hausdorff space</a>,
and \(\mathcal{A} \subset C(S, \mathbb{R})\)
an algebra which contains a non-zero constant function.
Then \(\mathcal{A}\) is dense in \(C(S, \mathbb{R})\)
if and only if it separates points.</p>
<hr />
<p>Clearly, if we let \(S = B_r\),
we have a compact Hausdorff space,
and the collections of polynomials contains
the functions \(\{1,x\}\) and separates points.
Therefore we have \(\mathcal{A}\) is dense
in \(C(B_r, \mathbb{R}), \forall r > 0\)
with respect to the sup-norm.</p>
<p>Applying the same theorem to the derivatives,
we then have the same result for \(C^2(B_r, \mathbb{R})\)
with respect to a similar norm</p>
\[\| f \|_{B_r} := \sup_{x \in B_r, \, m = 0,1,2}
\left| \frac{\partial^m f}{\partial x^m} (x) \right|.\]
<p><em>proof (of Itô’s Lemma):</em>
We will similarly use a localization argument, i.e. define
\(\tau_r := \inf \{t \geq 0 : |X_t| + [X]_t > r \}\),
and replace \(X_t\) with \(X_{t \wedge \tau_r}\).</p>
<p>(Step 1, 2) Let \(\mathcal{A} \subset C^2(\mathbb{R})\)
be the collection of functions where Itô’s Lemma
is satisfied.
Trivially we have that \(\{1,x\}\) are in \(\mathcal{A}\),
and \(\mathcal{A}\) forms a vector space.</p>
<p>Next we show that \(\mathcal{A}\) forms an algebra.
In particular, suppose \(f,g \in \mathcal{A}\),
and define \(F_t := f(X_t), G_t := g(X_t)\).
Using the product rule gives us</p>
\[F_t G_t - F_0 G_0 =
\int_0^t F_s dG_s + \int_0^t G_s dF_s + [F,G]_t \,.\]
<p>Using the Fundamental Theorem and Itô’s Lemma on \(g\),
we get</p>
\[\int_0^t F_s dG_s = \int_0^t f(X_s)
\frac{\partial g}{\partial x}(X_s) dX_s
+ \frac{1}{2} \int_0^t f(X_s)
\frac{\partial^2 g}{\partial x^2}(X_s) d[X]_s \,.\]
<p>and observe the same is true
switching the order of \(F,G\).
Next we use Itô’s Lemma
and expand with the Kunita-Watanabe identity to get</p>
\[[F,G]_t = \int_0^t \frac{\partial f}{\partial x}(X_s)
\frac{\partial g}{\partial x}(X_s) d[X]_s \, ,\]
<p>where the extra terms are zero because
the covariation with one finite variation process is zero,
i.e. \([ \,[X]\, ,Y ]_t = 0\) as \([X]_t\) has
finite variation.
By grouping the integrals by the integrators
(e.g. \(d[X]_t\)), we get that \(fg\)
satisfies Itô’s Lemma or simply
\(fg \in \mathcal{A}\).</p>
<p>(Step 3) Here we can apply <em>the Stone-Weierstrass Theorem</em>
to get that \(\mathcal{A}\) is dense in \(C^2(B_r)\)
with respect to the norm \(\|\cdot\|_{B_r}\).</p>
<p>(Step 4) It remains to show that \(\mathcal{A}\)
is closed with respect to \(\|\cdot\|_{B_r}\).
In particular, let \((f_n)_{n \geq 1}\)
be a sequence in \(\mathcal{A}\)
such that \(f_n \to f\) in \(\|\cdot\|_{B_r}\).
Then we have</p>
\[\int_0^t \left| \frac{\partial^2 f_n}{\partial x^2}(X_s) -
\frac{\partial^2 f}{\partial x^2}(X_s)
\right| d[X]_s
\leq \|f_n - f\|_{B_r} [X]_t \, .\]
<p>At the same time, we also have
by Itô’s Isometry</p>
\[\begin{align*}
\mathbb{E} \left(
\int_0^t \frac{\partial f_n}{\partial x}(X_s)
- \frac{\partial f}{\partial x}(X_s) dX_s
\right)^2
&= \mathbb{E} \int_0^t
\left(\frac{\partial f_n}{\partial x}(X_s)
- \frac{\partial f}{\partial x}(X_s)
\right)^2 d[X]_s \\
&\leq \|f_n - f\|_{B_r} [X]_t \, .
\end{align*}\]
<p>Since the process is localized we have that \([M]_t \leq r\),
and therefore we can pass the limit in the Itô formula
and get</p>
\[\begin{align*}
f(X_t) - f(X_0)
&= \lim_{n\to\infty} f_n(X_t) - f_n(X_0) \\
&= \lim_{n\to\infty} \int_0^t
\frac{\partial f_n}{\partial x}(X_s) dX_s
+ \frac{1}{2} \int_0^t
\frac{\partial^2 f_n}{\partial x^2}(X_s) d[X]_s \\
&= \int_0^t
\frac{\partial f}{\partial x}(X_s) dX_s
+ \frac{1}{2} \int_0^t
\frac{\partial^2 f}{\partial x^2}(X_s) d[X]_s \,.
\end{align*}\]
<p>Finally, since Itô’s Lemma hold for all \(r>0\),
we can simply take \(r\to\infty\) to complete the proof.</p>
\[\tag*{$\Box$}\]
<p><strong>Remark</strong>
Clearly the alternative proof is <em>not necessarily easier</em>,
however let us observe a couple of advantages.</p>
<p>Firstly, none of the steps above were very complicated,
as most steps followed directly from useful
(and well known) propositions.
Notably, a first time reader of this subject
will have a much easier time following the steps
and seeing the bigger picture,
rather than getting trapped by technical details.</p>
<p>Secondly, we now have an additional interpretation
of the second integral in the formula,
which clearly arises
as a consequence of Itô’s product rule
and Kunita-Watanabe identity.
For the readers that have not seen the proof,
it follows almost directly from the definition,
i.e. a direct consequence of choosing <em>the left Riemann sum</em>.</p>
<!--
## Another Application
Since we have already seen a detailed proof,
for this section we shall leave out the technical details,
and only sketch the technique on a high level.
In fact the complete proofs for this section
are much more difficult,
therefore we direct the rigorous reader to Dudley (2002).
Suppose we are interested in proving the following
well known theorem in probability theory.
---
**Theorem (Prohorov)**
Let a sequence of probability measures $$\{\mathbb{P}_n\}_{n\geq 1}$$
be [uniformly tight](https://en.wikipedia.org/wiki/Tightness_of_measures),
i.e. $$\forall \epsilon > 0, \exists K \subset \Omega$$
compact such that
$$ \mathbb{P}_n(K) \geq 1 - \epsilon, \forall n \in \mathbb{N}.
$$
Then there exists a subsequence such that
$$ \mathbb{P}_{n_k} \to \mathbb{P} $$ weakly,
where $$\mathbb{P}$$ is a tight Borel probability measure.
---
Here we can define $$\forall f \in C_b(S)$$
(i.e. continuous and bounded)
$$ I(f) := \lim_{n\to\infty} \int f d\mathbb{P}_n.
$$
It is not too difficult to show that $$I$$
is a continuous linear functional on $$C_b(S)$$,
and it follows from
[Riesz Representation Theorem](https://en.wikipedia.org/wiki/Riesz%E2%80%93Markov%E2%80%93Kakutani_representation_theorem)
that there exists a regular Borel measure
$$\mathbb{P}$$ such that
$$ I(f) = \int f d\mathbb{P}. $$
Then it remains to show that $$\mathbb{P}$$
is a probability measure.
A curious reader may already suspect that
there exists a related version of
Riesz Representation specific to probability measures,
indeed we will investigate such a theorem -
and of course using the same Stone-Weierstrass type technique.
(Step 1)
We start by defining a slightly different
algebraic structure.
**Definition** A collection of functions
$$\mathcal{A} \subset \{f: S \to \mathbb{R}\} $$ is
called a **Stone vector lattice** if
$$\forall f,g \in \mathcal{A}$$ we have
(i) $$ cf + g \in \mathcal{A}, $$
(ii) $$ f\wedge g, f\vee g \in \mathcal{A}, $$
(iii) $$ f\wedge 1 \in \mathcal{A}. $$
Now we are ready to state the main theorem.
---
**Theorem (Stone-Daniell)**
If $$\mathcal{A}$$ is a Stone vector lattice
and $$I:\mathcal{A} \to \mathbb{R}$$ is
a continuous linear functional such that
$$ f \geq 0 \implies I(f) \geq 0.
$$
Then exists a unique measure $$\mu$$
on the minimal $$\sigma$$-algebra such that
all elements of $$\mathcal{A}$$ are measurable,
so that $$\forall f \in \mathcal{A}$$ we have
$$ I(f) = \int f d\mu.
$$
---
Before we dive into the technique,
observe that Stone-Daniell Theorem applies to all functions,
where as Prohorov's Theorem only needs $$C_b(S)$$.
Therefore we are taking a "detour" trying to prove
a much harder result.
The author suspects there may be a simpler
more direct approach, perhaps addressed in a future post.
Next we provide a brief sketch that *skips most of the details*
unrelated to the Stone-Weierstrass type technique.
To start we define (not quite a measure)
for $$f \leq g$$ in $$\mathcal{A}$$
$$ \nu([f,g)) := I(g - f),
$$
and eventually define the desired measure as
$$ \mu(A) := \nu([0, 1_A)).
$$
At this point, we just need to check $$\mu$$
satisfies all the property we desire
in the statement of the theorem.
(Step 2,3)
Intuitively, constructing a Lebesgue integral only requires
approximation by simple functions;
similarly for this proof,
it turns out we only need to approximate indicators
functions from sets of the type
$$ f^{-1}((1, \infty)) $$.
To this goal, we can define
$$ g_n := [ n( f - f \wedge 1) ] \wedge 1,
$$
which is also contained in $$\mathcal{A}$$
and $$g_n \to 1_{f^{-1}((1, \infty))}$$.
(Step 4)
As for closure of $$\mathcal{A}$$,
it is even easier as we have all
the limit theorems (e.g. monotone convergence)
to work with.
By these means, (the author promises)
we can construct the integral
$$ I(f) = \int f d\mu $$
with the desired properties.
Finally, to complete the proof of Prohorov's Theorem,
the reader may recognize that we can indeed
recover a regular Borel probability measure by
testing the case $$f = 1$$
and showing that $$I(f) = 1$$.
-->
<h2 id="summary">Summary</h2>
<p>We have shown the Stone-Weierstrass Theorem
is not only a strong result on its own,
but leads to a powerful technique in general.
In particular, we saw a nice alternative proof
of Itô’s Lemma with much better interpretations.
<!-- At this point, the author will have to
apologize for the length of the blog post;
this turned out to be much longer than intended. -->
Ideally, the author would have liked
to add another example,
but the post is already quite long at this point.
Hopefully the readers will still have enjoyed
an interesting blog post,
and added another proof technique in their arsenal.</p>
<p>Please comment below (new feature!)
for any questions or feedback!</p>
<h2 id="an-interesting-story-to-wrap-up">An Interesting Story to Wrap Up</h2>
<p>For the longest time, the lemma was credited to
<a href="https://en.wikipedia.org/wiki/Kiyosi_It%C3%B4">Kiyosi Itô</a>
alone in his
<a href="https://projecteuclid.org/euclid.nmj/1118764702">1950 paper</a>.
This was until the 1990s with a resurgence of interests in
the late French-German mathematician
<a href="https://en.wikipedia.org/wiki/Wolfgang_Doeblin">Wolfgang Doeblin</a>,
who was well known to be quite gifted.
The interests led to a demand to open the remaining
“pli cacheté” (sealed envelope)
held by the French Academy of Sciences,
which he submitted just before he passed away in 1940 -
he burned his notes and took his own life
so the German soldiers cannot take advantage of his work.
To everyone’s surprise,
Doeblin’s letter contained significant
research progress ahead of his time,
including a statement of the same change of variables formula!
To honour his contribution,
the result is sometimes referred to as
the Itô-Doeblin Lemma.</p>
<p>For the interested readers,
I would strongly recommend an excellent
<a href="https://link.springer.com/article/10.1007/s780-002-8399-0">commentary</a>
by Bernard Bru and Marc Yor (2002)
for further details on this topic.</p>
<h2 id="references">References</h2>
<ul>
<li>Bru, B. & Yor, M. (2002). Comments on the life and mathematical legacy of Wolfgang Doeblin.. Finance and Stochastics, 6, 3-47.
<!-- - Dudley, R.M. (2002). Real Analysis and Probability. Cambridge University Press --></li>
<li>Karatzas, I. & Shreve, S.E. (1991). Brownian Motion and Stochastic Calculus. Springer New York</li>
<li>Miller, J. (2016). Stochastic Calculus, Lent 2016 Lecture Notes. Retrieved from http://statslab.cam.ac.uk/~jpm205/teaching/lent2016/lecture_notes.pdf</li>
</ul>{"name"=>"", "avatar"=>"Profile3.jpg", "bio"=>nil, "location"=>nil, "employer"=>nil, "pubmed"=>nil, "googlescholar"=>"https://scholar.google.com/citations?user=9dSlc_cAAAAJ", "email"=>"mufan.li@princeton.edu", "researchgate"=>nil, "uri"=>nil, "bitbucket"=>nil, "codepen"=>nil, "dribbble"=>nil, "flickr"=>nil, "facebook"=>nil, "foursquare"=>nil, "github"=>nil, "google_plus"=>nil, "keybase"=>nil, "instagram"=>nil, "impactstory"=>nil, "lastfm"=>nil, "linkedin"=>"mufan-bill-li-35749833", "orcid"=>nil, "pinterest"=>nil, "soundcloud"=>nil, "stackoverflow"=>nil, "steam"=>nil, "tumblr"=>nil, "twitter"=>"mufan_li", "vine"=>nil, "weibo"=>nil, "xing"=>nil, "youtube"=>nil, "wikipedia"=>nil}mufan.li@princeton.eduIn a similar sense to line integrals, stochastic calculus extends the classical tools to working with stochastic processes. One of the most elegant and useful result is the change of variable formula for stochastic integrals, commonly known as Itô’s Lemma (see end of this post for a discussion on Doeblin’s contribution). While this lemma is quite easy to use, the proof usually relies heavily on technical lemmas, hence difficult to develop intuition, especially for the first time reader.Connected by Poincaré Inequality2017-12-30T00:00:00-05:002017-12-30T00:00:00-05:00https://mufan-li.github.io/poincare_inequality<p>While studying two seemingly irrelevant subjects,
probability theory and partial differential equations (PDEs),
I ran into a somewhat surprising overlap:
the <a href="https://en.wikipedia.org/wiki/Poincar%C3%A9_inequality">Poincaré inequality</a>.
On one hand, it is not out of the ordinary
for analysis based subjects to share inequalities
such as <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz</a>
and <a href="https://en.wikipedia.org/wiki/H%C3%B6lder%27s_inequality">Hölder</a>;
on the other hand, the two forms of
Poincaré inequality have quite different applications.</p>
<p>In this blog post, I hope to put together some excellent content
I studied recently, specifically from:</p>
<ul>
<li><em>Concentration Inequalities</em> by Boucheron, Lugosi, and Massart (2013)</li>
<li><em>Partial Differential Equations</em> by Evans (2010)
<!-- - *Functional Analysis, Sobolev Spaces, and Partial Differential Equations*
by Haim Brezis (2011) --></li>
</ul>
<!-- Both books are great references, I would strongly recommend both
for an intuitive yet rigorous read in their respective topics. -->
<h2 id="a-simple-inequality">A Simple Inequality</h2>
<p>We first state a very simple version of the inequality:</p>
<hr />
<p><strong>Theorem (A Simple Poincaré Inequality)</strong>
Let \(\Omega \subset \mathbb{R}^n\) be open and bounded,
and let \(f \in C^1_c(\Omega)\) (differentiable
with compact support).
Then there exists a constant \(C\) that depends only
on \(\Omega\) such that:</p>
<p>\[ \left\lVert f \right\rVert_{L^2(\Omega)}
\leq C \lVert \nabla f \rVert_{L^2(\Omega)}
\]</p>
<!-- where $$\overline{f}$$ is the average of the function
$$f$$ in domain $$\Omega$$. -->
<hr />
<p>Quick aside: we say a function \(f\) has <strong>compact support</strong>
if the set \(S = \{ x \in \Omega : f(x) \neq 0 \}\)
has compact closure.
This implies \(f(x) = 0\) near the boundary.</p>
<p>Observe that the inequality simply bounds
the \(L^2\)-norm of a function in terms of the \(L^2\)-norm
of its gradient instead.
Note the compact support here is an important assumption
when we are integrating with respect to the Lebesgue measure.
Consider for example a constant function,
then this inequality would fail as the gradient is zero.
The reader may be comforted that a general form will require
much fewer assumptions, and can be generalized to
all \(L^p\) norms.</p>
<p>The reason we start with this inequality
is because the proof is quite straightforward:</p>
<p><em>proof (of the Simple Poincaré Inequality):</em></p>
<p>Without loss of generality,
we let \(\Omega \subset [0,M]^n\)
for some large \(M > 0\),
and by the Cauchy-Schwarz inequality we have</p>
<p>\[ \vert f(x) \vert^2 \leq \left\vert \int_0^{x_1}
\frac{\partial }{\partial x_i} f(y_1, x_2, \ldots) dy_1 \right\vert^2
\leq \left[ \int_0^M 1^2 dy_1 \right]
\left[ \int_0^M \left\vert \frac{\partial f}{\partial x_1} \right\vert^2 dy_1 \right]
\]</p>
<p>Summing over all \(n\) possible derivatives,
and integrating over \(\Omega\) we have</p>
<p>\[ n \int_\Omega \vert f(x) \vert^2
\leq \int_\Omega \sum_{i=1}^n M \int_0^M \left\vert
\frac{\partial f}{\partial x_i} \right\vert^2 dy_i
= \sum_{i=1}^n M \left\lVert \frac{\partial f}{\partial x_i}
\right\rVert^2_{L^2(\Omega)}
\]</p>
<p>where in the last step we exchanged the order of integration,
and used the fact that the \(L^2\) norm is a constant.
Rewriting the above we get the desired result</p>
<p>\[ \lVert f \rVert_{L^2(\Omega)}
\leq \frac{M}{\sqrt{n}} \lVert \nabla f \rVert_{L^2(\Omega)}
\]</p>
\[\tag*{$\Box$}\]
<h2 id="as-a-concentration-inequality">As a Concentration Inequality</h2>
<p>We now state the inequality in a form
most useful for probability theory, see
Theorem 3.20 from Boucheron, Lugosi, Massart (2013):</p>
<hr />
<p><strong>Theorem (Gaussian-Poincaré Inequality)</strong>
Let \(X = (X_1, \ldots, X_n)\) be a vector of i.i.d.
standard Gaussian random variables.
Let \(\, f : \mathbb{R}^n \to \mathbb{R}\)
be any continuously differentiable function.
Then</p>
<p>\[ \text{Var}[f(X)] \leq
\mathbb{E}\left[ | \nabla f(X)|^2 \right] \]</p>
<hr />
<p>Observe that the inequality is slightly different.
Firstly this time the norm is centered,
although centering in this case is not an issue
since \(Var[X] \leq \mathbb{E}X^2\).
Secondly due to the measure being a probability measure,
we have a much smaller constant on the inequality \(C=1\).
In combination, we were also able to drop
the compact support assumption.</p>
<p>An immediate consequence is to consider
\(f\) Lipschitz with coefficient \(1\),
i.e. \(| f(x) - f(y) | \leq \|x - y\|\),
then we have</p>
<p>\[ \text{Var}[f(X)] \leq 1 \]</p>
<p>In other words, we just found a constant bound
on the variance for a huge class of random functions!
In general, we can consider \(f\) to be a smooth estimator
based on a dataset with noise \(X\).
The Poincaré inequality will provide
a very useful bound on estimation error.
<!-- If this doesn't excite you, I don't know what does :) --></p>
<p>To prove this inequality, we will use a famous
result from 1981 (Theorem 3.1 in
Boucheron, Lugosi, Massart (2013)):</p>
<hr />
<p><strong>Theorem (Efron-Stein Inequality)</strong>
Let \(X = (X_1, \ldots, X_n)\) be a vector of i.i.d.
random variables and let \(Z = f(X)\) be a square-integrable
function of \(X\).
Then</p>
<p>\[ \text{Var}(Z) \leq \sum_{i=1}^n \mathbb{E}
\left[ \left( Z - \mathbb{E}^{(i)}Z \right)^2 \right] \]</p>
<p>where \(\mathbb{E}^{(i)}Z = \int
f(X_1, \ldots, X_{i-1}, x_i, X_{i+1},\ldots) d\mu_i(x_i)\),
i.e. the expectation over \(X_i\) only.</p>
<hr />
<p>The Efron-Stein inequality can be proved by decomposing
the variance as a sum of telescoping differences of
conditional expectations,
and applying Jensen’s inequality to the individual terms.
While we omit the proof here,
we should remark that the simple Efron-Stein inequality
has wide ranging applications;
we will only look at one such use for the proof
of the Poincaré inequality,
taken from Theorem 3.20 in Boucheron, Lugosi, Massart (2013):</p>
<p><em>proof (of Gaussian-Poincaré Inequality):</em></p>
<p>First we observe that a direct application of
the Efron-Stein inequality can reduce the problem down
to \(n=1\), i.e. it is sufficient to show</p>
<p>\[ \mathbb{E}^{(i)} \left[ \left( Z - \mathbb{E}^{(i)}Z \right)^2 \right]
\leq \mathbb{E}^{(i)} \frac{\partial f}{\partial x_i}(X)^2 \]</p>
<p>From here we assume without loss of generality \(n=1\).
Then we notice that it is sufficient to prove
this inequality for compactly supported, twice differentiable
functions, i.e. \(f \in C_c^2(\mathbb{R})\),
since otherwise we can just take a limit to the original function.</p>
<p>Here we let \(\epsilon_1,\ldots,\epsilon_n\) be i.i.d.
Rademacher random variables, i.e.
\(\mathbb{P}[\epsilon_j = 1] = \mathbb{P}[\epsilon_j = -1] =
\frac{1}{2} \,\forall j \in \{ 1,2,\ldots,n \}\),
and we define</p>
<p>\[ S_n = n^{-1/2} \sum_{j=1}^n \epsilon_j \]</p>
<p>Observe that for every \(i\) we have</p>
<p>\[ \text{Var}^{(i)}[f(S_n)] =
\frac{1}{4} \left[ f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right)
- f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right]^2 \]</p>
<p>Applying the Efron-Stein inequality, we get</p>
<p>\[ \text{Var}[f(S_n)] \leq \frac{1}{4} \sum_{i=1}^n
\mathbb{E} \left[ \left(
f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right)
- f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right)
\right)^2 \right] \]</p>
<p>Let \(K = \sup_x \vert f''(x) \vert\), then we have that</p>
<p>\[ \left|f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right)
- f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right)\right|
\leq \frac{2}{\sqrt{n}} |f’(S_n)| + \frac{2K}{n}
\]</p>
<p>which implies</p>
<p>\[ \frac{n}{4} \left(
f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right)
- f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right)
\right)^2
\leq f’(S_n)^2 + \frac{2K}{\sqrt{n}} | f’(S_n) |
+ \frac{K^2}{n}
\]</p>
<p>Finally the central limit theorem then imply the desired result</p>
<p>\[ \limsup_{n\to\infty} \frac{1}{4} \sum_{i=1}^n
\mathbb{E}
\left[ \left(
f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right)
- f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right)
\right)^2 \right]
= \mathbb{E} \left[ f’(X)^2 \right]
\]</p>
\[\tag*{$\Box$}\]
<p><strong>Remark</strong>
There are also Poincaré type inequalities for
non-Gaussian random variables,
for example if \(X\sim\)Poisson\((\mu)\):</p>
<p>\[ \text{Var}[f(X)] \leq \mu \mathbb{E}\left[
(f(X+1) - f(X))^2 \right]
\]</p>
<p>Or if \(X\) is double exponential
i.e. with density \(\frac{1}{2}e^{-\vert x \vert}\),
then we have:</p>
<p>\[ \text{Var}[f(X)] \leq 4 \mathbb{E}\left[
(f’(X))^2 \right]
\]</p>
<h2 id="an-application-to-pdes">An Application to PDEs</h2>
<p>To do proper justice for the theory of PDEs,
we will need a significant background in functional analysis.
In this section, we will try to side-step the technical details
and focus on one single application,
that is showing the existence and uniqueness of
a weak solution for <strong>Poisson’s equation</strong>:</p>
<p>\[ -\Delta u = f \; \text{ in } \Omega \]
\[ u = g \; \text{ in } \partial \Omega \]</p>
<p>where \(\Omega \subset \mathbb{R}^n\) is open bounded with
smooth boundaries \(\partial \Omega\).
By <strong>weak solution</strong>, we meant there exists a \(u \in C^1(\Omega)\)
such that \(\forall v \in C^1_c(\Omega)\) we have</p>
<p>\[ B[u,v] := \int_\Omega \nabla u \cdot \nabla v = \int_\Omega f v
\]</p>
<p>Note if \(u\) is a solution to the (original) Poisson’s equation,
then we have the above weak equation by
<a href="https://en.wikipedia.org/wiki/Green%27s_identities">Green’s identity</a>.
The main tool we will use to prove existence and uniqueness is
the following result:</p>
<hr />
<p><strong>Theorem (Lax-Milgram)</strong>
Let \(H\) be a Hilbert space, \(B: H^2 \to \mathbb{R}\)
be a continuous, coersive, bilinear form.
Then \(\forall \varphi \in H^*\),
there exists a unique \(u\in H\) such that</p>
<p>\[ B(u,v) = <\varphi, v>
\quad \forall v \in H
\]</p>
<p>where \(<\varphi,v>\) is the linear functional
\(\varphi\) applied to \(v\).</p>
<hr />
<p><strong>Remark</strong> Before going into the definitions and technical details,
we observe that Lax-Milgram Theorem gives us exactly
what we want - the existence and uniqueness!
Now we just have to fill in the blanks:</p>
<ul>
<li>define a <a href="https://en.wikipedia.org/wiki/Hilbert_space">Hilbert space</a>
\(H\) where the solution lives in</li>
<li>show that \(B(u,v)\) is continuous and coersive (we will define later)</li>
<li>let \(<\varphi, v> = \int_\Omega f v\)</li>
</ul>
<p><strong>Step 1</strong>
To define our Hilbert space,
we will consider the following
<a href="https://en.wikipedia.org/wiki/Inner_product_space">inner product</a>:</p>
<p>\[ (u,v) := \int_\Omega [uv + \nabla u \cdot \nabla v] \]</p>
<p>which corresponds to the following <strong>Sobolev norm</strong>:</p>
<p>\[ \lVert u \rVert_{H^{1}(\Omega)}
:= (u,u)^{1/2}
= \left[ \lVert u \rVert_{L^2(\Omega)}^2 +
\lVert \nabla u \rVert_{L^2(\Omega)}^2 \right]^{1/2}
\]</p>
<p>By equipping the space \(C^1_c(\Omega)\)
with the above inner product,
we almost have a Hilbert space!
Here we will simply take the completion of \(C^1_c(\Omega)\)
with respect to the Sobolev norm,
i.e. add all the limit points to the space.
We call this (completed) Hilbert space \(H_0^1(\Omega)\).</p>
<!-- At the same time, we should note the form on the right hand side
is just a geometric mean of two quantities related
by the Poincaré inequality!
This observation will be key to our proof. -->
<p><strong>Step 2</strong>
We now turn our attention to
\(B(u,v)\), the bilinear form
(fancy term for separately linear in both inputs).
Then we say \(B\) is <strong>continuous</strong> if</p>
<p>\[ \exists C_1 > 0 : \forall u,v \in H,
\vert B(u,v) \vert
\leq C_1 \lVert u \rVert_{H^1(\Omega)}
\lVert v \rVert_{H^1(\Omega)}
\]</p>
<p>Note this is an immediate consequence of Cauchy-Schwarz inequality</p>
<p>\[ \vert B(u,v) \vert \leq \lVert \nabla u \rVert_{L^2(\Omega)}
\lVert \nabla v \rVert_{L^2(\Omega)}
\leq \lVert u \rVert_{H^1(\Omega)} \lVert v \rVert_{H^1(\Omega)}
\]</p>
<p>We say \(B\) is <strong>coersive</strong> if</p>
<p>\[ \exists C_2 > 0 : \forall u \in H,
B(u,u) \geq C_2 \lVert u \rVert_{H^1(\Omega)}^2
\]</p>
<p>We notice this is the only non-trivial condition left to check,
and to prove this we will finally use <em>Poincaré inequality</em>!
Start by rewriting</p>
<p>\[ B(u,u) = \int_\Omega \nabla u \cdot \nabla u
= \lVert \nabla u \rVert_{L^2(\Omega)}^2 \]</p>
<p>Applying the Poincaré inequality on half of the norm
we have</p>
<p>\[ \frac{1}{2} \lVert \nabla u \rVert_{L^2(\Omega)}^2
\geq \frac{1}{2C} \lVert u \rVert_{L^2(\Omega)}^2
\]</p>
<p>Therefore</p>
<p>\[ B(u,u) \geq \frac{1}{2} \lVert \nabla u \rVert_{L^2(\Omega)}^2
+ \frac{1}{2C} \lVert u \rVert_{L^2(\Omega)}^2
\geq \min\left(\frac{1}{2}, \frac{1}{2C}\right)
\lVert u \rVert_{H^1(\Omega)}^2
\]</p>
<p>And voilà, we have existence and uniqueness!
A rigorous and careful reader may notice that \(u\)
does not necessarily have compact support -
this is correct.
However every \(u \in H_0^1(\Omega)\)
is a limit of compactly supported functions,
therefore we just need to take a limit to get our result!</p>
<p><strong>Remark</strong> In fact, we can use similar Lax-Milgram
based methods to show existence and uniqueness
for a large subset of
<a href="https://en.wikipedia.org/wiki/Elliptic_partial_differential_equation">elliptical PDEs</a>.
We should note that
the fact we can “convert” between \(\|u\|\) and \(\|\nabla u\|\)
is highly useful for studying Sobolev norms.
We refer curious readers to Evans (2010)
for an excellent chapter on
<a href="https://en.wikipedia.org/wiki/Sobolev_space">Sobolev spaces</a>
and related inequalities.</p>
<h2 id="final-words">Final Words</h2>
<p>I have a weak spot for connections between different fields,
probably because it’s always surprising,
and surprises are intriguing in math!
I hope to have to presented a readable introduction
to the inequality and its applications in both topics,
without drowning readers in technical details.
On this note, I should remark that to study Sobolev spaces rigorously,
the reader will need to go through all the details carefully!</p>
<p>As this is my first blog post, any constructive feedback or suggestions on
future topics will be appreciated!</p>
<h2 id="references">References</h2>
<ul>
<li>Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.</li>
<li>Evans, L. C. (2010). Partial differential equations. Providence, R.I.: American Mathematical Society.</li>
</ul>
<!-- This nice inequality is named after the brilliant
[Henri Poincaré](https://en.wikipedia.org/wiki/Henri_Poincar%C3%A9),
who contributed much more to mathematics than a blog post can contain. -->
<!-- Observe that these inequalities now have
*different constant coefficients*!
As we will see soon, the more general PDE version of Poincaré
inequality does in fact have a constant coefficient. -->
<!-- ## Some Background in Sobolev Spaces
I will introduce a few technical definitions
so I can state the inequality rigorously.
Note these technicalities are only necessary for
applications in PDEs.
Let $$\Omega \subset \mathbb{R}^n$$ be bounded
with a smooth ($$C^1$$) boundary,
and let $$u,v \in L^1_{\text{loc}}(\Omega)$$
(i.e. integrable on compact subsets).
We say $$u$$ has a **weak derivative** $$v$$
with respect to $$x_i$$ if $$\forall \varphi \in C^\infty_c(\Omega)$$
(compact support) we have
\\[ \int_\Omega
u \frac{\partial \varphi}{\partial x_i}
= \int_\Omega v \varphi \\]
Then if $$u$$ has all weak derivatives of degree 1,
we can define a $$L^p$$ like norm as follows
\\[ \lVert u \rVert_{W^{1,p}(\Omega)} := \left[
\lVert u \rVert_{L^p(\Omega)}^p
+ \sum_{i=1}^n
\left\lVert \frac{\partial u}{\partial x_i}
\right\rVert_{L^p(\Omega)}^p \right]^{1/p}
\\]
The above **Sobolev norm** can be interpreted as a "sum" of
all the weak derivatives' $$L^p$$ norms.
We then define the normed-space of such functions
a **Sobolev space**,
denoted $$W^{1,p}(\Omega)$$.
Note we can extend the definition up to the $$k^\text{th}$$
degree weak derivative, which is denoted $$W^{k,p}$$ instead.
An important space we will look at is the $$W^{1,p}$$-closure
of $$C^\infty_c(\Omega)$$, denoted $$W^{1,p}_0(\Omega)$$.
In other words, all the limit points of $$C^\infty_c(\Omega)$$
in the norm we defined above.
## In the Context of Sobolev Spaces
Finally we can state the second form of
Poincaré inequality,
(quoting Corollary 9.19 from Brezis (2011)):
---
**Theorem (Poincaré Inequality)**
Suppose $$1 \leq p < \infty$$, $$\Omega$$ bounded open.
Then there exists a constant $$C$$ depending on
$$\Omega, p$$ such that:
\\[ \lVert u \rVert_{L^p(\Omega)}
\leq C \lVert \nabla u \rVert_{L^p(\Omega)}
\quad \forall u \in W^{1,p}_0(\Omega)
\\]
---
**Remark** To connect with the previous inequality,
we need to consider squaring both sides,
taking $$p=2$$, and change the Lebesgue measure
to a probability measure
(note the Radon-Nikodym derivative is just the density,
which is bounded for Gaussian).
Lastly, there is in fact a version of this inequality
with the left hand side centered -
i.e. $$\|u - \overline{u}\|_{L^p(\Omega)}$$
where $$\overline{u}$$ is the average. We will discuss this at the end.
The following proof is based on Proposition 9.18 from Brezis (2011):
*proof (of Poincaré's Inequality)*
Firstly since $$u \in W^{1,p}_0(\Omega)$$,
there exists a sequence
$$ u_k \in C^\infty_c(\Omega) : u_k \to u$$
in $$W^{1,p}(\Omega)$$. \\
Using Green's identity from calculus,
and the fact that $$u_k$$ have compact support we have
\\[ \left\vert \int_\Omega u_k
\frac{\partial \varphi}{\partial x_i} \right\vert
= \left\vert \int_\Omega
\frac{\partial u_k}{\partial x_i} \varphi \right\vert
\leq \left\lVert \frac{\partial u_k}{\partial x_i}
\right\rVert_{L^p(\Omega)} \lVert \varphi \rVert_{L^q{\Omega}}
\quad \forall \varphi \in C^\infty_c(\Omega)
\\]
where we used Hölder's inequality in the last step. -->{"name"=>"", "avatar"=>"Profile3.jpg", "bio"=>nil, "location"=>nil, "employer"=>nil, "pubmed"=>nil, "googlescholar"=>"https://scholar.google.com/citations?user=9dSlc_cAAAAJ", "email"=>"mufan.li@princeton.edu", "researchgate"=>nil, "uri"=>nil, "bitbucket"=>nil, "codepen"=>nil, "dribbble"=>nil, "flickr"=>nil, "facebook"=>nil, "foursquare"=>nil, "github"=>nil, "google_plus"=>nil, "keybase"=>nil, "instagram"=>nil, "impactstory"=>nil, "lastfm"=>nil, "linkedin"=>"mufan-bill-li-35749833", "orcid"=>nil, "pinterest"=>nil, "soundcloud"=>nil, "stackoverflow"=>nil, "steam"=>nil, "tumblr"=>nil, "twitter"=>"mufan_li", "vine"=>nil, "weibo"=>nil, "xing"=>nil, "youtube"=>nil, "wikipedia"=>nil}mufan.li@princeton.eduWhile studying two seemingly irrelevant subjects, probability theory and partial differential equations (PDEs), I ran into a somewhat surprising overlap: the Poincaré inequality. On one hand, it is not out of the ordinary for analysis based subjects to share inequalities such as Cauchy-Schwarz and Hölder; on the other hand, the two forms of Poincaré inequality have quite different applications.