To compute the conditional dynamics of Markov processes, we will use the h-transform by Joseph Doob [Bl10]. Let us consider a time-homogeneous Markov process \(\{ X(t) \}_{t \geq 0}\) to be conditioned on a shift invariant event \(A\), i.e.
\[\mathbb{P}( \{ X(t) \}_{t\geq 0} \in A | X(0) = x ) = \mathbb{P}( \{ X(t+s) \}_{t\geq 0} \in A | X(s) = x ) \,.\]An important example of a shift invariant event is the gambler’s ruin example, where \(X(t) \in (0,c)\) is a martingale, and the event can be defined as
\[A := \left\{ X(t) \text{ hits } c \text{ before } 0 \right\} \,.\]Here \(X(t)\) is intended to model a gambler’s wealth process in a fair betting game, and it’s well known that \(\mathbb{P}(A) = \frac{X(0)}{c}\) (a direct consequence of optional stopping theorem).
We will provide a simple sketch of the h-transform result (see [Bl10] for a rigorous proof). Here we introduce the following notations:
\[\begin{split} \mathbb{P}_x( \cdot ) &:= \mathbb{P}( \cdot | X(0) = x ) \,, \\ h(x) &:= \mathbb{P}_x(A) \,, \\ P^t(x, dy) &:= \mathbb{P}_x( X(t) \in dy ) \,, \\ \tilde{P}^t(x, dy) &:= \mathbb{P}_x( X(t) \in dy | A) \,, \end{split}\]where \(h(x)\) is the key transform function, and \(P^t(x,dy)\) is the transition kernel, which completely characterizes the dynamics of a Markov process. Therefore our goal is to compute \(\tilde{P}^t(x,dy)\), which we can just use Bayes’ rule
\[\begin{aligned} \tilde{P}^t(x, dy) &= \mathbb{P}_x( X(t) \in dy | A) \\ &= \frac{ \mathbb{P}_x(A | X(t) \in dy) \mathbb{P}_x( X(t) \in dy ) }{ \mathbb{P}_x(A) } \\ &= \frac{h(y)}{h(x)} P^t(x, dy) \,, \end{aligned}\]where we used the shift invariance of \(A\) to write \(\mathbb{P}_x(A \vert X(t) \in dy) = h(y)\). In other words, the Radon–Nikodym derivative for the transition kernel is simply the ratio \(h(y)/h(x)\)!
To make calculations even simpler, we will also compute the effect on the infinitesimal generator, namely the operator \(L\) defined as follows:
\[L[f](x) := \lim_{t \to 0} \frac{ \mathbb{E}_x f(X(t)) - f(x) }{t} \,, \quad \text{ if it exists, }\]where we define \(\mathbb{E}_x(\cdot) := \mathbb{E}( \cdot \vert X(0) = x)\).
We want to then compute
\[\begin{aligned} \tilde{L}f &:= \lim_{t \to 0} \frac{ \mathbb{E}_x( f(X(t)) | A ) - f(x) }{t} \\ &= \lim_{t \to 0} \frac{1}{t} \left( \int f(y) \tilde{P}^t(x,dy) - f(x) \right) \\ &= \lim_{t \to 0} \frac{1}{t} \left( \int f(y) \frac{h(y)}{h(x)} P^t(x,dy) - f(x) \right) \\ &= \frac{1}{h(x)} \lim_{t \to 0} \frac{1}{t} \left( \int (f(y) h(y) - f(x) h(x)) P^t(x,dy) \right) \\ &= \frac{1}{h(x)} L[fh](x) \,, \end{aligned}\]where we used the h-transform result above and the definition of the generator.
At this point, we will recall a well known that \(h\) is harmonic, i.e. \(Lh = 0\), to save some calculations (I suspect the letter “h” in h-transform stands for “harmonic”). We also recall that for a diffusion process \(dX(t) = \mu(X(t))\,dt + dB(t)\), the generator follows from Itô’s Lemma
\[L[f](x) := \langle \mu(x), \nabla f(x) \rangle + \frac{1}{2} \Delta f(x) \,.\]Using the harmonic property, we have the following clean formula for the transformed generator
\[\tilde{L} [f](x) = \frac{ L[fh](x) }{h(x)} = \langle \mu(x) + \nabla \log h(x), \nabla f(x) \rangle + \frac{1}{2} \Delta f(x) \,,\]which corresponds to the diffusion process
\[d\tilde{X}(t) = ( \mu(\tilde{X}(t)) + \nabla \log h(\tilde{X}(t)) ) \, dt + dB(t) \,.\]To summarize, h-transform simply adds a drift term to the original process! Here we remark that although the above formula looks simple in terms of \(h(x)\), the function \(h(x)\) itself is often quite complicated, making this calculation at least convoluted if not completely intractable. This is why the proof coming out so clean in the next section is a huge surprise.
Here we will first state the main result.
Theorem Let \(\{ \lambda_i(t) \}_{t\geq 0, i \in [n]}\) be the Dyson Brownian motions, i.e. \(\lambda(t)\) satisfy the following stochastic differential equation (SDE)
\[d\lambda_i(t) = dB_i(t) + \sum_{j \neq i} \frac{dt}{ \lambda_i(t) - \lambda_j(t) } \,,\]where the initial conditions satisfy \(\lambda_1(0) > \lambda_2(0) > \cdots > \lambda_n(0)\), and {\(B_i(t)\)} are independent standard Brownian motions. Then we have the following equality in distribution
\[\{ \lambda_i(t) \}_{t \geq 0, i \in [n]} \overset{d}{=} \{ B_i(t) \}_{t \geq 0, i \in [n]} \vert A \,,\]where we define \(A := \{ \{B_i(t)\} \text{ do not intersect } \}\).
Before we start, we note the event of \(n\) Brownian motions to not intersect for all time has zero probability. Then what does it even mean to condition on an event of zero probability? Well we would consider a collection of events \(\{A_c\}_{c>0}\) converging to the null event \(A\), such that \(\mathbb{P}(A_c) > 0\) for all \(c>0\), and we can compute dynamics of these Brownian motions in the limit as \(c\to \infty\).
To define these events \(A_c\), we will define the Vandermonde determinant:
\[\Delta_n( \lambda ) = \prod_{1 \leq i < j \leq n} ( \lambda_i - \lambda_j ) \,.\]Here we observe that since \(\lambda\) is sorted in decreasing order, we have that
\[\Delta_n( \lambda ) > 0 \iff \lambda_i \neq \lambda_j \quad \forall i \neq j \,.\]Therefore, we can define the events \(A_c := \{ \Delta_n( B(t) ) \text{ hits } c \text{ before } 0 \}\). Observe that we can indeed recover the non-intersection event \(A\) in the limit
\[A = \lim_{c \to \infty} A_c \,.\]Recalling the gambler’s ruin example, if \(\Delta_n( B(t) )\) is a martingale, we have a very simple formula for the h-transform
\[h_c( x ) := \mathbb{P}_{x}( A_c ) = \frac{ \Delta_n(x) }{ c } \,.\]Indeed we will first prove this result.
Lemma \(\Delta_n(B(t))\) is a martingale.
proof (of Lemma): We will directly compute the SDE of \(\Delta_n(B(t))\) using Itô’s Lemma
\[d \Delta_n(B(t)) = \frac{1}{2} \Delta \Delta_n(B(t)) \, dt + \cdots dB(t) \,,\]where hide the diffusion term since the Itô integral with respect to a martingale is also a martingale. Therefore it’s sufficient to show the drift term is zero.
Using the identity
\[\frac{1}{(a-b)(a-c)} + \frac{1}{(b-a)(b-c)} + \frac{1}{(c-a)(c-b)} = 0 \,,\]it’s a simple calculation to show that \(\Delta \Delta_n(x) = 0\) via this symmetry, and hence the desired result follows.
\[\tag*{$\Box$}\]proof (of Theorem): It remains to compute the h-transform dynamics, in particular the drift term
\[\begin{aligned} \partial_i \log h_c(x) &= \partial_i ( \log \Delta_n(x) - \log c ) \\ &= \frac{1}{\Delta_n(x)} \sum_{j \neq i} \frac{\Delta_n(x)}{ x_i - x_j } \,, \end{aligned}\]which implies that conditioned on the event \(A_c\), the h-transformed process satisfy the SDE
\[\begin{aligned} d \lambda_i(t) &= \partial_i \log h_c( \lambda(t) ) \, dt + dB_i(t) \\ &= \sum_{j \neq i} \frac{ dt }{ \lambda_i(t) - \lambda_j(t) } + dB_i(t) \,. \end{aligned}\]Finally, to complete the proof, we observe the dynamics of \(\lambda(t)\) is invariant to changes in \(c>0\), therefore taking \(c \to \infty\) recovers the unbounded dynamics of Dyson Brownian motion.
\[\tag*{$\Box$}\]That’s it! That’s the proof! Having played around with h-transforms before, and getting only ridiculously ugly expressions, it’s quite remarkable to me that this proof was able to avoid messy calculations all together.
What helped simplified this proof? To quote Bálint Virág: “[t]his is because the Vandermonde [determinant] is harmonic, so this fits into the h-transform language.” Indeed, we saw above that \(\Delta \Delta_n(x) = 0\) played an important role in the calculations. So let this be a rule of thumb for future problems: try to define events using harmonic functions in h-transforms!
For those interested in random matrix theory, Terrence Tao wrote some very nice lecture notes on the applications of Dyson Brownian motion, which can also be found in his book [Ta12]. In particular, we can use Dyson Brownian motion to derive the eigenvalue density of a Gaussian unitary ensemble (GUE) matrix, common known as the Ginibre formula:
\[\rho(\lambda) = \frac{1}{(2\pi)^{n/2} 1! \cdots (n-1)!} e^{ -|\lambda|^2/2 } |\Delta_n(\lambda)|^2 \,,\]which can then be used to derive the famous Wigner’s semicircle law in the limit \(n \to \infty\).
Let us start with a potential function \(F : \mathbb{R}^d \to \mathbb{R}\), an inverse temperature parameter \(\beta > 0\), and we define the Gibbs density as
\[\nu(x) := \frac{1}{Z} e^{ -\beta F(x) } \,,\]where \(Z = \int e^{-\beta F(x)} dx\) is the normalizing constant.
We say \(\nu\) satisfies the Poincaré inequality with constant \(\kappa > 0\), denoted \(\text{PI}(\kappa)\), if
\[\int f^2 \, d\nu - \left( \int f \, d\nu \right)^2 \leq \frac{1}{\kappa \, \beta} \int | \nabla f |^2 \, d\nu \,,\]for all \(f \in C^1(\mathbb{R}^d) \cap L^2(\nu)\). Note we adopt the convention of [BGL13] which adjusts the right hand side by a factor of \(\beta\), and the two conventions agree when \(\beta = 1\).
\(\text{PI}(\kappa)\) is well known to be equivalent to exponential convergence of Langevin diffusion [BGL13, Theorem 4.2.5], quadratic-linear cost transport inequality [Vil08, Theorem 22.25], and Cheeger’s isoperimetric inequality [LV18, Theorem 11]. Furthermore, \(\text{PI}(\kappa)\) also implies dimension free exponential concentration [Vil08, Theorem 22.32], and serves as a key tool for deriving existence, uniqueness, and smoothness results in partial differential equations [Eva10]. Therefore a tight lower bound for the Poincaré constant is widely desired for a large range of applications.
Firstly, we will recall the (overdamped) Langevin diffusion is defined by the following stochastic differential equation (SDE)
\[dX_t = \underbrace{ - \nabla F(X_t) \, dt }_{ \text{gradient flow}_{} } + \underbrace{ \sqrt{ 2/\beta } \, dW_t }_{ \text{perturbation}_{} }\,,\]where \(\{W_t\}_{ t_{} \geq 0 }\) is a standard \(d\)-dimensional Brownian motion. Observe that when \(\beta\) becomes large, the Brownian motion term becomes very small. Therefore Langevin diffusion can be interpreted as a perturbed gradient flow.
Since the Gibbs density \(\nu\) finds the global minimum of \(F\) as \(\beta \to \infty\), a dimension and temperature free Poincaré constant implies a fast convergence of Langevin diffusion to the global minimum when \(\beta\) is large. Therefore it is no surprise that
In other words, strongly convex functions are easy to optimize, general non-convex functions are hard, what else is new? What is new are the cases in between: non-strongly convex functions with a unique minimum. However, even when weakening to \(F\) being only convex, this problem remains open - this is an equivalent formulation of the KLS conjecture [LV18].
Conjecture (Kannan-Lovász-Simonovits, Poincaré version) There exists a universal constant \(\kappa > 0\), such that for all positive integer \(d\), and all convex function \(F: \mathbb{R}^d \to \mathbb{R}\) such that the Gibbs density \(\nu(x) = \frac{1}{Z} e^{-F(x)}\) (note \(\beta = 1\) here) has zero mean and identity covariance matrix, we have that \(\nu\) satisfies \(\text{PI}(\kappa)\).
I should briefly mention that a recent arXiv preprint [Che20] proposed a result equivalent to an almost constant lower bound on the Poincaré constant of order \(O( d^{-o(1)} )\), which in the limit of \(d\to\infty\) converges to \(0\) slower than \(d^{-r}\) for all \(r>0\). For the sake of staying on topic, we will leave this extremely interesting subject for a future post.
Furthermore, it is already known that an adaptive perturbation of gradient descent escapes saddle points at a dimension free rate [JGN+17]. This hints at the possibility of establishing a dimension and temperature free Poincaré inequality for even non-convex potential functions! Indeed, we will discuss this next.
The main result of this blog post is actually an intermediate result of [LE20, Proposition 9.11]. While the original proposition is proved for a product manifold of spheres, it can be easily adapted to \(\mathbb{R}^d\) with a containment type condition, see for example [Vil06, Theorem A.1] and [MS14, Assumption 1.4]. Since as of writing this post, there is no complete proof of this adaptation, we will state it as a claim.
Claim (Adapting [LE20, Proposition 9.11]) Suppose \(F:\mathbb{R}^d \to \mathbb{R}\) have a unique local (and therefore global) minimum, and all saddle points are strict, i.e. the minimum eigenvalue \(\lambda_{\text{min}}( \nabla^2 F ) < - \lambda\) for some constant \(\lambda>0\) at saddle points. Then under appropriate containment conditions, and choosing \(\beta\) sufficiently large, we have that \(\nu(x) = \frac{1}{Z} e^{-\beta F(x)}\) satisfies \(\text{PI}(\kappa)\) for a constant \(\kappa>0\) independent of \(\beta, d\).
As we discussed earlier, perturbation helps gradient descent escape saddle points, therefore this result is very intuitive. However, deriving a quantitative bound is completely non-trivial. We remark that for non-convex potentials, most approaches to establishing a Poincaré inequality will yield exponentially poor dependence on both \(\beta\) and \(d\). To our best knowledge, only [MS14] has established a bound independent of \(\beta\), and (by our calculations) exponential in \(d\).
Let us start by emphasizing this result does not imply the KLS conjecture. Indeed, the conditions of the conjecture does not require \(F\) to have a unique minimum. Hence \(F\) needs not to be strongly convex around the minimum. This is a key requirement of the proof technique.
It does, however, imply that the KLS conjecture can be extended beyond convex functions. More precisely, if a potential \(F\) satisfies a dimension and temperature free Poincaré inequality, we can add saddle points to \(F\) without losing this property. Therefore, we can replace convex potentials \(F\) in the statement of the KLS conjecture with modifications of convex potentials \(F\) with strict saddle points. In fact, this naturally leads us to further conjecture the strictness of saddle points can be relaxed as well, since it’s the parallel of relaxing strong convexity to convexity for saddle points.
Additionally, notice that we can take \(\beta\) to be as large as we want. This implies that the amount of randomness added to gradient flow does not affect its ability to escape saddle points. In other words, any tiny amount of perturbation will help escape saddle points. Furthermore, we emphasize this implies a discretization of Langevin diffusion, i.e. a perturbed gradient descent, will also escape strict saddle points - this was the main result of [LE20]. This is in sharp contrast with (deterministic) gradient descent taking up to exponential time to escape a saddle point [DJL+17], implying the addition of noise, even arbitrarily small, fundamentally changes the behaviour of gradient descent.
Now to my favourite part of this post, where we actually describe the proof techniques. We will see that despite the lengthy calculations in [LE20], the proof idea is quite straight forward to explain. We start by stating a Lyapunov criterion for the Poincaré inequality.
Theorem [BBCG08, Theorem 1.4 Adapted] Let \(U \subset \mathbb{R}^d\) be such that \(\nu\) restricted to \(U\) satisfies \(\text{PI}(\kappa_U)\). Suppose there exist constants \(\theta > 0, b \geq 0\) and a function \(V \in C^2(\mathbb{R}^d)\) such that \(V \geq 1\) and
\[LV := \langle -\nabla F, \nabla V \rangle + \frac{1}{\beta} \Delta V \leq -\theta \, V + b {1}_{U_{}} \,,\]Then \(\nu\) satisfies \(\text{PI}(\kappa)\) with constant
\[\kappa = \frac{ \theta }{ 1 + b / \kappa_U} \,.\]Intuitively, we can think of the Lyapunov function \(V\) as an energy measure of the Langevin diffusion \(\{X_t\}_{t_{} \geq 0}\), \(LV\) as the time evolution of \(V\) via Itô’s Lemma, and the Lyapunov condition (inequality) describes the rate of energy dissipation over time. This energy \(V\) will decrease as \(X_t\) gets closer to \(U\), hence \(U\) behaves like an attractor. Once \(X_t\) reaches \(U\), the process begins to “mix” due to the Poincaré inequality on \(U\). In our case, we will choose \(U\) to be a small neighbourhood of the global minimum, and use the strong convexity (Bakry-Émery criterion) to get a Poincaré constant \(\kappa_U\). For those that find this description familiar, indeed this is the diffusion equivalent of the drift and minorization conditions for Markov chain mixing [MT09].
Similar to other Lyapunov function based methods in differential equations, constructing such a function \(V\) is the main difficulty. [MS14] observed that when \(F\) has only strict saddle points, the choice of \(V = \exp\left( \frac{\beta}{2} F \right)\) works very nicely away from saddle points. In fact, we can directly compute the Lyapunov condition
\[\frac{LV}{V} = \frac{1}{2} \Delta F - \frac{\beta}{4} |\nabla F|^2 \,,\]and observe that as long as \(|\nabla F|\) is bounded away from zero, we can choose \(\beta\) to be large, hence forcing \(\frac{LV}{V}\) to be negative. In other words, excluding small neighbourhoods around saddle points, \(\nu\) satisfies a Poincaré inequality.
The precise version of this result can be found in [LE20, Lemma 9.10]. In particular, we can compute this constant to be dimension and temperature free.
Now we have narrowed the problem down only constructing a Lyapunov function for the neighbourhoods around saddle points, we will observe replacing the inequality of the Lyapunov condition with equality gives us the Poisson equation. Consequently, we have a stochastic representation of the solution in terms of the escape time.
Theorem [BdH16, Theorem 7.15 Adapted] [LE20, Corollary 9.3] Let \(B \subset \mathbb{R}^d\), \(\{X_t\}_{t_{}\geq 0}\) be the Langevin diffusion, and \(\tau_{B^c}\) be the first escape time of \(X_t\) from \(B\). Suppose there exists a constant \(\theta>0\) such that
\[V(x) := \mathbb{E} [ \, \exp( \theta \, \tau_{B^c} ) \, | \, X_0 = x \, ] < \infty \,, \quad \forall x \in B \,,\]then \(V\) is the unique solution to the Poisson equation
\[\begin{split} LV &= - \theta \, V \,, \quad & x \in B \,, \\ V &= 1 \,, \quad & x \in \partial B \,. \end{split}\]Readers with a Markov chain background may recognize this escape time based condition to be equivalent to drift and minorization [DMPS18, Theorem 14.1.3]. In fact, this method was inspired by the nice connection drawn between diffusions and Markov chains.
Additionally, readers familiar with concentration inequalities may recognize the theorem’s condition is known as exponential integrability, and it’s one of the equivalent characterizations for sub-exponential random variables. Indeed, we will actually use a slightly easier equivalent form for calculations.
Theorem [Wai19, Theorem 2.13] For a zero mean random variable \(\tau\), the following are equivalent:
At this point, it’s then sufficient to establish an exponentially decaying tail bound for \(\tau_{B^c}\). To this goal, we will make several observations:
To illustrate this point clearly, let us consider the quadratic function \(f(x,y) = x^2 - \frac{\lambda}{2} y^2\) with a saddle point at \((x,y) = (0,0)\). For the Langevin diffusion to escape a neighbourhood of radius \(r>0\), it’s sufficient to ensure the \(y\)-component exceeds \(r>0\). Therefore, it’s sufficient to restrict \(f\) to only its \(y\)-component, which makes \(y=0\) a local maximum. Therefore it’s sufficient to study the Langevin diffusion restricted to escaping an one dimension local maximum, i.e.
\[dX_t = \lambda X_t \, dt + \sqrt{ 2/\beta } \, dW_t \,,\]where \(-\lambda\) upper bounds the smallest eigenvalue of \(\nabla^2 F\) at saddle points.
We observe that this SDE is the “negative” Ornstein-Uhlenbeck process, and it has a closed form solution
\[X_t = X_0 e^{\lambda t} + \sqrt{2/\beta} \, \int_0^t e^{\lambda(t-s)} dW_s \,,\]which corresponds to \(X_t \sim N( X_0 e^{\lambda t} \,, \frac{1}{\lambda \beta}(e^{2\lambda t} - 1) )\). Finally, plugging in the Gaussian density and a few calculations later, we get the desired result of
\[\mathbb{P} [ \, |\tau_{B^c}| \geq t \, ] \leq \mathbb{P} [ \, X_t \in B \, ] \leq c e^{ -\lambda t } \,,\]where the constant \(c\) does not depend on \(t\). I.e. this escape time tail bound implies that \(V(x)\) is a valid Lyapunov function, and hence implies a Poincaré inequality.
Quite a few technical details were swept under the rug to simplify the proof sketch, as the reader might expect. Probably the most significant is the approximation of \(F\) by a quadratic function - it is actually not very straight forward to connect an approximation bound to an escape time bound.
At the same time, the requirement of \(\beta\) to be sufficiently large is quite unsatisfying. Intuitively, why would adding noise hurt the mixing of a Markov process? It feels to me that this condition is merely a technical constraint, and a more careful analysis could sharpen or remove this condition. Hopefully the readers will have more thoughts and ideas than I do.
Thanks to reading up this point, and I wish everyone a happy new year!
For the sake of writing a self-contained blog post, we will not attempt to provide a description of spin glass models. Instead, we will state the problem in the most mathematically interesting form, without explaining where the quantities came from.
Let \(\xi:[0,1]\to \mathbb{R}\) be twice differentiable and strictly increasing and strictly convex (i.e. \(\xi', \xi'' > 0\)), also let \(\zeta:[0,1] \to [0,1]\) be a cumulative distribution function (CDF). We will consider the Parisi partial differential equation (PDE) defined as follows
\[\begin{cases} \partial_t \Phi = \frac{ - \xi''(t) }{2} \left[ \partial_{xx} \Phi + \zeta(t) \left( \partial_x \Phi \right)^2 \right], \\ \Phi(1,x) = \log \cosh(x) \,, \end{cases}\]where the time derivative is defined by the right limit for when \(\zeta(t)\) is discontinuous.
It is well known that we can solve this PDE backwards in time using a Hopf-Cole transformation; in fact, we will provide a sketch in a a later section. This allows us to state an optimization objective as follows:
\[\inf_{\zeta} \Phi_\zeta(0,x),\]where we are minimizing over the set of all CDFs on \([0,1]\) for each \(x \in \mathbb{R}\). Finally we can state the question as follows:
Question Does there exist a unique minimizer to the optimization problem \(\inf_{\zeta} \Phi_\zeta(0,x)\) for each \(x\in\mathbb{R}\)?
The main difficulty comes from the unclear dependence on \(\zeta\), even if we can write down a closed form solution to the Parisi PDE. At the very least, it would be extremely unpleasant and tedious to work with. Additionally, we remark that the problem is already stated in a simplified form, as opposed to the original framing in spin-glass.
Before we jump into the main results, we observe that existence of a minimizer is straight forward to prove. Since we are restricted to the domain \([0,1]\), any sequence of probability measures is tight. It is then sufficient to consider any sequence of probability measures \(\{\zeta_n\}\) that minimizes \(\Phi_\zeta(0,x)\), and tightness implies there exist a converging subsequence such that \(\zeta_{n_k} \to \zeta^*\) weakly, which is a minimizer of \(\Phi(0,x)\).
To complete the proof, it is sufficient to show \(\Phi(0,x)\) is strictly convex in \(\zeta\). In this section, we will use a stochastic representation to show convexity, which is the main difficulty of the problem. Readers unfamiliar with stochastic analysis can find a brief introduction in a previous blog post, in particular we will use Itô’s Lemma in the upcoming proofs.
We start by defining \(B_t := W_{\xi'(t)}\), where \(\{W_t\}\) is a standard Brownian motion. Let \(\{\mathcal{F}_t\}_{t\geq 0}\) be \(\{W_t\}\)’s canonical filtration, and then we define a collection of processes
\[\mathcal{D} := \left\{ (u_t)_{0 \leq t \leq 1} : u_t \text{ is adapted to } \mathcal{F}_t, |u_t| \leq 1 \right\}.\]For simplicity of notation, we will write \(\sigma(t) = \sqrt{\xi''(t)}\) for this section. At this point we will state the main result.
Theorem (Auffinger-Chen Representation) For all \(\zeta\) a probability distribution on \([0,1]\), we have the following
\[\begin{split} \Phi(0,x) = \max_{u \in \mathcal{D}} \bigg[ \mathbb{E} & \Phi\left(1, x + \int_0^1 \sigma^2(s) \, \zeta(s) \, u_s \, ds + \int_0^1 \sigma(s) \, dW_s \right) \\ &- \frac{1}{2} \int_0^1 \sigma^2(s) \, \zeta(s) \, \mathbb{E} u_s^2 \, ds \bigg]. \end{split}\]In particular, we have the maximizer is unique, and is given by \(u_s = \partial_x \Phi(s, x + X_s)\), where \(X_s\) is the strong solution of the following stochastic differential equation (SDE)
\[dX_s = \sigma^2(s) \, \zeta(s) \, \partial_x \Phi(s, x + X_s) \, ds + \sigma(s) \, dW_s, \quad X_0 = 0.\]Remark Before we begin the proof, we will observe that \(\Phi(0,x)\)’s convexity follows directly from this representation. Firstly both integral terms containing \(\zeta\) are linear in \(\zeta\). Since \(\Phi(1,x) = \log \cosh (x)\) is convex in \(x\), we have the \(\Phi\) term is convex in \(\zeta\). Next the expectation over the sum of two convex functions remain convex. Finally, a maximum (or supremum) over convex functions remain convex, proving the desired convexity result!
Before we start, we will state several technical (but not difficult to prove) Lemmas. To guarantee a strong solution of the SDE, it is sufficient to have \(\partial_x \Phi(s, x)\) be Lipschitz in \(x\). We will omit the proof of these results as they are not important to the main goal of this blog post. Instead we will state the following Lemma containing the desired estimates.
Lemma (Derivative Estimates) For all \(\zeta\) probability distributions on \([0,1]\), we have that
\[|\partial_x \Phi(t, x)| \leq 1, |\partial_{xx} \Phi(t,x)| \leq 1.\]Another important result we will omit is the continuity of \(\Phi\) in \(\zeta\).
Lemma (Lipschitz in \(L^1\)) For any discrete distributions \(\zeta_1, \zeta_2\), and for all \(k \in \mathbb{N}\), we have that
\[\begin{split} \left| \Phi_{\zeta_1} - \Phi_{\zeta_2} \right| &\leq \xi''(1) \int_0^1 |\zeta_1(t) - \zeta_2(t)| dt, \\ \left| \partial_x^k \Phi_{\zeta_1}(t,x) - \partial_x^k \Phi_{\zeta_2}(t,x) \right| &\leq c_k \, \xi''(1) \int_0^1 |\zeta_1(t) - \zeta_2(t)| dt. \end{split}\]Since we can approximate any distributions in \(L^1\) by discrete distributions, then we can extend the definition of \(\mathcal{P}(\cdot)\) and \(\Phi(t,x)\) to all distributions by continuity. Therefore it is sufficient to prove the result for only finitely supported distributions.
proof (of the Auffinger-Chen representation): The proof will be a straight forward application of Itô’s Lemma, and the results follow almost directly from invoking the Parisi PDE.
We start with discrete \(\zeta\), i.e \(\zeta\) is a piecewise constant function. Let \(u \in \mathcal{D}\), and define
\[dX_s := \sigma^2(s) \, \zeta(s) \, u_s \, ds + \sigma(s) \, dW_s, \quad X_0 = 0,\]and let \(Y_s := \Phi(s, x + X_s)\). Then we observe that
\[X_1 = \int_0^1 \sigma^2(s) \, \zeta(s) \, u_s \, ds + \int_0^1 \sigma(s) \, dW_s\]appears exactly inside the first \(\Phi\) term of the Auffinger-Chen representation.
At this point we adopt concise notation and write \(\Phi := \Phi(s, x + X_s)\), and apply Itô’s Lemma to \(Y_s\) to get
\[dY_s = \left[ \partial_s \Phi + \sigma^2(s) \, \zeta(s) \, u_s \, \partial_x \Phi + \frac{1}{2} \sigma^2(s) \, \partial_{xx} \Phi \right] ds + \sigma(s) \partial_x \Phi \, dW_s.\]Here we note while the time derivative \(\partial_s \Phi\) does not exist at finitely many points, we will eventually only use it in integral form. Using the Parisi PDE at points of continuity, we can make the following substitution
\[\partial_s \Phi + \frac{1}{2} \sigma^2(s) \partial_{xx} \Phi = - \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (\partial_x \Phi)^2.\]We will make the substitution and complete the square to get
\[\begin{split} dY_s &= \left[ \sigma^2(s) \, \zeta(s) \, u_s \, \partial_x \Phi - \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (\partial_x \Phi)^2 \right] ds + \sigma(s) \, \partial_x \Phi \, dW_s \\ &= \left[ \frac{1}{2} \sigma^2(s) \, \zeta(s) \, u_s^2 - \frac{1}{2} \sigma^2(s) \, \zeta(s) \, (u_s - \partial_x \Phi )^2 \right] ds + \sigma(s) \, \partial_x \Phi \, dW_s. \end{split}\]Next we write this equation as an integral over \([0,1]\), and taking expectation to remove the martingale term we get
\[\begin{split} \mathbb{E} \Phi(1, x + X_1) - \Phi(0,x) =& \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \, \mathbb{E} u_s^2 \, ds \\ &- \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \, \mathbb{E} (u_s - \partial_x \Phi)^2 ds. \end{split}\]Since \(\Phi, \partial_x \Phi\) are continuous in \(\zeta\), we can extend this equation to all \(\zeta\). Furthermore, since the second integral is always positive, we must have the inequality
\[\Phi(0,x) \geq \mathbb{E} \Phi(1, x + X_1) - \int_0^1 \frac{1}{2} \sigma^2(s) \, \zeta(s) \, \mathbb{E} u_s^2 \, ds,\]and the inequality must be strict unless \(u_s = \partial_x \Phi\) almost surely.
Observe this proves the inequality of the representation. Since \(|\partial_x \Phi| \leq 1\), we have \(u_s = \partial_x \Phi \in \mathcal{D}\), hence achieving the equality in the representation.
\[\tag*{$\Box$}\]At this point, the author believes the goal of the blog post is already achieved: we have demonstrated the key technique with only very basic manipulations. That being said, to complete the story, we will provide a short sketch on how to prove strict convexity - hence proving there is a unique minimizer of \(\Phi_\zeta(0,x)\).
We once again start with a key technical lemma.
Lemma (Strict Convexity in \(x\)) For all \(\zeta\) a probability distribution on \([0,1]\), and for all \(s \in [0,1]\), we have
\[\partial_{xx} \Phi(s,x) > 0.\]Here we remind the reader that strict convexity in \(x\) does not directly imply strict convexity in \(\zeta\). We could just take this result for granted, but there is a nice proof using the Hopf-Cole transform and another stochastic representation, so why not?
sketch (of Lemma): Since \(\Phi(t,x)\) is continuous in \(\zeta\), we will only consider a discrete \(\zeta\). Then using an appropriate time change and time reversal, we can get a new PDE
\[\partial_t \Phi = \frac{1}{2 \widehat \zeta(t)} \partial_{xx} \Phi + \frac{1}{2} (\partial_x \Phi)^2,\]with initial conditions (as opposed to terminal conditions) \(\Phi(0,x)=\log \cosh(x)\), and \(\widehat \zeta(t) = \zeta(1 - t)\) changed due to time reversal. To simplify the PDE, we use the Hopf-Cole transformation to substitute \(\phi = \exp\left( \frac{\Phi}{\widehat\zeta(t)} \right)\), which leads to the simplified linear PDE
\[\partial_t \phi = \frac{1}{2 \widehat \zeta(t)} \partial_{xx} \phi,\]with initial conditions \(\phi(0,x) = \frac{1}{\widehat\zeta(t)} \log \exp\left( \frac{\log \cosh(x)}{\widehat \zeta(t)} \right) = \cosh(x)^{1/\widehat \zeta(t)}\). Using another time change, we can also remove the \(\widehat \zeta(t)\) above.
Here we can use any of the reader’s favourite method: the Feynman-Kac Representation, the Kolmogorov backward equation, or the heat kernel to write
\[\phi(t,x) = \mathbb{E} \cosh( x + W_t )^{1/\widehat \zeta(t)},\]where \(W_t\) is a standard Brownian motion, and \(\widehat \zeta\) is constant in \([0,t)\). At this point it is sufficient to show strict convexity for this \(t\), since we can piece together the constant intervals later. To this end, we will write
\[\Phi(t,x) = \frac{1}{\widehat \zeta(t)} \log \mathbb{E} \cosh(x + W_t)^{1/\widehat \zeta(t)},\]and define
\[\langle f(W_t) \rangle := \frac{ \mathbb{E} f(W_t) \cosh(x + W_t)^{1/\widehat \zeta(t)} }{ \mathbb{E} \cosh(x + W_t)^{1/\widehat \zeta(t)} } \, ,\]where we observe since \(\cosh(x) > 0\) and \(\mathbb{E}\cosh(x + W_t) < \infty\), we have that \(\langle \cdot \rangle\) defines a new probability measure. In particular, we have Jensen’s inequality.
With this we can take the second derivative of \(\Phi\) to get
\[\begin{split} \partial_{xx} \Phi(t,x) &= \frac{-1}{\widehat \zeta(t)} \left\langle \frac{1}{\widehat \zeta(t)} \tanh(x + W_t) \right\rangle^2 \\ & \quad+ \frac{1}{\widehat \zeta(t)} \left\langle \left( \frac{1}{\widehat \zeta(t)} \tanh(x + W_t) \right)^2 + \frac{1}{\widehat \zeta(t)} \left( 1 - \tanh(x + W_t)^2 \right) \right\rangle \\ &\geq \frac{1}{\widehat \zeta(t)} \left\langle \frac{1}{\widehat \zeta(t)} \left( 1 - \tanh(x + W_t)^2 \right) \right\rangle \\ &> 0 \, , \end{split}\]where we used Jensen’s inequality and the fact that \(\tanh(x)^2 < 1\).
Finally we return to strict convexity in \(\zeta\).
sketch (of Strict Convexity in \(\zeta\)): We will start by introducing quantities related to convexity. Let \(\zeta_1 \neq \zeta_2\), and let \(\zeta = \lambda \zeta_1 + (1-\lambda) \zeta_2\) for some \(\lambda \in (0,1)\). Recalling \(\Phi(1,x) = \log \cosh (x)\), and using the optimal \(u_s = \partial_x \Phi_\zeta(s, x + X_s)\), where \(X_s\) defined with respect to \(\zeta\). Note this \(u_s\) is not necessarily optimal for \(\zeta_1, \zeta_2\).
Since \(\log\cosh(x)\) is convex, we can write
\[\Phi_\zeta(0,x) \leq \lambda_1 A_1 + \lambda_2 A_2,\]where each \(A_i\) is defined as
\[\begin{split} A_i := \mathbb{E}& \, \log \cosh \left( x + \int_0^1 \sigma^2(s) \, \zeta_i(s) \, u_s \, ds + \int_0^1 \sigma(s) dW_s \right) \\ & - \frac{1}{2} \int_0^1 \sigma^2(s) \, \zeta_i(s) \, \mathbb{E} u_s^2 \, ds. \end{split}\]Since \(\log \cosh(x)\) is strictly convex, the inequality is strict unless
\[\int_0^1 \sigma^2(s) \, \zeta_1(s) \, u_s \, ds = \int_0^1 \sigma^2(s) \, \zeta_2(s) \, u_s \, ds,\]almost surely. Using the Auffinger-Chen representation, we have that \(A_i \leq \Phi_{\zeta_i}(0,x)\). Therefore to prove the convexity is strict, it is sufficient to prove a gap in the first inequality, which is equivalent to saying that
\[Z := \int_0^1 \sigma^2(s) \, (\zeta_1(s) - \zeta_2(s)) \, u_s \, ds\]has positive variance. The variance can be computed as
\[\text{Var}(Z) = \int_0^1 \int_0^1 \varphi(s) \, \varphi(t) \, \text{Cov}(u_s, u_t) \, ds dt,\]where \(\varphi(s) = \sigma^2(s) \, (\zeta_1(s) - \zeta_2(s))\).
While we omit the technical details, it’s not hard to believe \(u_s = \partial_x \Phi(s, x + X_s)\) satisfy the following SDE (from Itô’s Lemma and differentiating the Parisi PDE)
\[du_s = \sigma(s) \partial_{xx} \Phi(s, x + X_s) dW_s.\]Observing that \(u_s\) is a martingale with independent increments, we can compute \(\text{Cov}(u_s, u_t)\) as
\[\text{Cov}(u_s, u_t) = \text{Var}(u_{s \wedge t}) = \int_0^{s \wedge t} \sigma^2(v) \mathbb{E} (\partial_{xx} \Phi(v, x + X_v))^2 dv,\]where the last step followed from Itô’s Isometry. Defining \(\tau(s) := \text{Var}(u_s)\), we can also write \(\text{Cov}(u_s, u_t) = \tau(s) \wedge \tau(t)\). With a bit of algebra we can derive
\[\text{Var}(Z) = \int_0^1 \left( \int_v^1 \varphi(s) ds \right)^2 \tau'(v) dv.\]Since \(\tau'(v) = \sigma^2(v) \mathbb{E} (\partial_{xx} \Phi(v, x + X_v))^2\), the desired result follows from the fact \(\partial_{xx} \Phi > 0\).
Recall the original problem of \(\inf_\zeta \Phi_\zeta(0,x)\). We have shown that while the dependence structure is unclear, we are able to prove its convexity with it easily using a stochastic representation. The author would like to point out that most techniques used here are quite basic, which is surprising for an originally very difficult problem.
The author would also like to point to a more general variational stochastic representation by Boué and Dupuis (1998), perhaps more useful for other applications.
Finally the post would not be possible without attending an excellent graduate course on spin glass taught by Dmitry Panchenko, where he has done a much better job explaining this topic. In particular, Dmitry has written an excellent book (Panchenko, 2013) with a bonus chapter covering this topic that can be found online. I would also highly recommends Dmitry’s notes on probability theory, which has been in general very helpful to the author’s studies and research.
With this motivation in mind, it was quite pleasant to discover a set of excellent lecture notes by Jason Miller (2016), which contained an alternative proof built on the idea of Stone-Weierstrass Theorem. We shall see that not only do we have a more interpretable proof, the technique is also generalizable beyond stochastic calculus. In particular, this blog post intends to illustrate the technique in detail through Itô’s Lemma.
We will introduce (without too much rigour) some basic definitions and results to support the proofs in later sections. The reader need not to carefully analyze the technical details here to understand the proofs to come. Readers familiar with stochastic calculus may skip to the next section.
First we let \((\Omega, \mathcal{F}, \{\mathcal{F}_t\}_{t\geq 0}, \mathbb{P})\) be a probability space equipped with a filtration (also satisfying the usual conditions to be rigorous). With this we can define several useful objects.
Definition A stochastic process \(X := \{X_t\}_{t\geq 0}\) is said to be a martingale if
(i) \(\forall t \geq 0\), we have \(X_t\) is measurable with respect to \(\mathcal{F}_t\), denoted \(X_t \in \mathcal{F}_t\);
(ii) \(\forall 0 \leq s \leq t\), we have \(\mathbb{E}[ X_t | \mathcal{F}_s ] = X_s\) a.s.
Definition We say a random variable \(\tau:\Omega \to [0,\infty]\) is a stopping time if \(\forall t \geq 0, \{\tau \leq t \} \in \mathcal{F}_t\).
An important property of stopping time is that if \(X_t\) is a martingale and \(\tau\) a stopping time, then \(X_{t \wedge \tau}\) is also a martingale.
Definition Let the interval \([0,T]\) be partitioned using increments of \(2^{-n}\), i.e. \(\{t_k^n\}_{k=0}^{\lceil T 2^n \rceil}\), where \(t_k^n = k 2^{-n} \wedge T\). Let \(X_t\) be a continuous martingale, and \(f_t\) be a continuous (possibly stochastic) process. We define the Itô integral as
\[ \int_0^T f_t \, dX_t := \lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor} f_{t_k^n} (X_{t_{k+1}^n} - X_{t_k^n}), \]
if the limit converges u.c.p. (uniformly on compact intervals in probability to be precise).
Remark Observe the above definition uses a left Riemann sum to define the integral, where as other choices will lead to different integrals. This is opposed to deterministic integrals, where the all choices are equivalent.
Definition Consider the same partition \(\{t_k^n\}\) as above. Let \(M,N\) be two continuous martingales, we define the quadratic covariation as
\[ [M,N]_T := \lim_{n\to\infty} [M,N]^n_T := \lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor} (M_{t_{k+1}^n} - M_{t_k^n}) (N_{t_{k+1}^n} - N_{t_k^n}), \]
where the limit is also u.c.p. We also define the quadratic variation as \([M]_T := [M,M]_T\).
Several useful results are stated next.
Proposition (Finite Variation) Let \(X,Y\) be continuous stochastic processes such that \(X\) has finite variation, i.e.
\[\lim_{n\to\infty} \sum_{k=0}^{\lfloor T 2^n \rfloor} | X_{t_{k+1}^n} - X_{t_k^n} | < \infty,\]and \([Y]_t > 0\) a.s. Then we have
\[[X,Y]_t = 0 \;\text{a.s.}\]Proposition (Itô’s Product Rule) Let \(X,Y\) be continuous martingales, then we have
\[X_t Y_t - X_0 Y_0 = \int_0^t X_s dY_s + \int_0^t Y_s dX_s + [X,Y]_t \,.\]Proposition (Fundamental Theorem) Let \(X,Y,Z\) be continuous martingales, then we have
\[\int_0^t X_s d\left( \int_0^s Y_u dZ_u \right) = \int_0^t X_s Y_s dZ_s.\]Proposition (Kunita-Watanabe Identity) Let \(X,Y,Z\) be continuous martingales, then we have
\[\left[ \int_0 X_s dY_s, Z \right]_t = \int_0^t X_s d[Y,Z]_s,\]where both uses of \([\;,\;]_t\) denotes the covariation.
Proposition (Itô’s Isometry)
Let \(M\) be a continuous martingale, and \(H\) be a continuous stochastic process. Then we have
\[\mathbb{E} \left[ \left( \int_0^t H_s dM_s \right)^2 \right] = \mathbb{E} \int_0^t H_s^2 d[M]_s.\]For the purpose of the blog post, we will only state and prove a much simpler version of the lemma, but it is not difficult to adapt to more general conditions.
Theorem (Itô’s Lemma) Let \(X_t\) be a continuous martingale, and \(f \in C^2(\mathbb{R})\). Then we have
\[ f(X_t) = f(X_0) + \int_0^t \frac{\partial f}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t \frac{\partial^2 f}{\partial x^2} (X_s) d[X]_s. \]
Here we will sketch the proof from Karatzas and Shreve (1991).
proof sketch: We start by defining a stopping time \(\tau_r := \inf \{t \geq 0 : |X_t| + [X]_t > r\}\), and replace \(X_t\) with \(X_{t \wedge \tau_r}\). This localization technique will allow us to only consider the function \(f\) in the interval \(B_r := [-r, r]\) (or a ball in higher dimensions), which has bounded derivatives.
By observing the lemma’s statement, the reader may notice the formula appears like the second order Taylor expansion of \(f(X_t)\). Indeed we can write
\[\begin{align*} f(X_t) - f(X_0) =& \lim_{n\to\infty} \sum_{k=0}^{\lfloor t 2^n \rfloor} f(X_{t_{k+1}^n}) - f(X_{t_{k}^n}) \\ =& \lim_{n\to\infty} \sum_{k=0}^{\lfloor t 2^n \rfloor} \Big\{ \frac{\partial f}{\partial x}(X_{t_{k}^n}) [X_{t_{k+1}^n} - X_{t_{k}^n}] \\ &+ \frac{1}{2} \frac{\partial^2 f}{\partial x^2} (\eta_k^n) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2 \Big\}, \end{align*}\]where \(\eta_k^n \in [X_{t_{k}^n}, X_{t_{k+1}^n}]\) is chosen as part of Taylor’s theorem to satisfy the above equality. It’s not difficult to see the first sum converges to the first stochastic integral, then it remains to show the second term converges.
To this goal, we will define
\[\begin{align*} J_1^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor} \frac{\partial^2 f}{\partial x^2} (\eta_k^n) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2, \\ J_2^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor} \frac{\partial^2 f}{\partial x^2} (X_{t_{k}^n}) [X_{t_{k+1}^n} - X_{t_{k}^n}]^2, \\ J_3^n &:= \sum_{k=0}^{\lfloor t 2^n \rfloor} \frac{\partial^2 f}{\partial x^2} (X_{t_{k}^n}) \{ [X]_{t_{k+1}^n} - [X]_{t_{k}^n} \}, \end{align*}\]where observe \(J_3^n\) converges to the desired integral. Next we will use the following technical inequality. Let \(|X_s| \leq K < \infty, \forall s \leq T\) be a martingale, then we have
\[\mathbb{E} ([X]^n_T)^2 \leq 6 K^4.\]Without stating the details, using this and Cauchy-Schwarz inequality, we can show
\[\lim_{n\to\infty} |J_1^n - J_2^n| = 0 \; \text{a.s.}\]To complete the proof, we will need one more technical lemma. Let \(|X_s| \leq K < \infty, \forall s \leq T\), then we have
\[\lim_{n\to\infty} \mathbb{E} \sum_{k=0}^{\lfloor t 2^n \rfloor} [ X_{t_{k+1}^n} - X_{t_k^n} ]^4 = 0.\]Then once again omitting the details, we can get
\[\mathbb{E} |J_2^n - J_3^n| \leq 2 \sup_{x \in B_r} \left| \frac{\partial^2 f}{\partial x^2}(x) \right|^2 \mathbb{E} \left[ \sum_{k=0}^{\lfloor t 2^n \rfloor} [ X_{t_{k+1}^n} - X_{t_k^n} ]^4 + [X]_t \max_{k} ( [X]_{t_{k+1}^n} - [X]_{t_{k}^n} ) \right],\]which combined with the previous lemma and bounded convergence theorem, we get the desired result
\[\lim_{n\to\infty} |J_2^n - J_3^n| = 0 \; \text{a.s.}\]Putting everything together gives us the desired formula as stated.
Remark The use of the propositions listed in the previous section is implicit in the two technical lemmas we stated above, where we also hide most of the proof difficulty in.
Interpretation This proof naturally leads to an interpretation that Itô’s Lemma as a consequence of Taylor’s expansion. However this proof provides no clear intuition on why the second order approximation is the correct order, and pushes the justification to complicated technical details. Probably the most troubling consequence is that a different integration scheme (e.g. Stratonovich which rises from a mid-point Riemann sum) leads to a different change of variable formula, therefore the Taylor expansion intuition can lead to further confusion.
At this point, we will first take a step back from Itô’s Lemma and look at a rough sketch of the proof technique.
Suppose we want to prove a collection of functions (e.g. \(C^2([a,b])\)) satisfy a certain property \((P)\), we will start by defining \(\mathcal{A}\) as the subset of \(C^2([a,b])\) that satisfies the desired property \((P)\).
(Step 1) We will identify a certain algebraic structure such that \(\mathcal{A}\) is closed under, e.g. for an algebra (over a field) we have if \(f,g \in \mathcal{A}\), then \(cf + g, fg \in \mathcal{A}\). In other words, an algebra is a vector space with an associative vector multiplication.
(Step 2) Then we can say that the collection \(\mathcal{A}\) (or a dense subset) is generated by some very simple functions, e.g. under an algebra, the functions \(\{1, x\}\) generate the entire collection of polynomials.
(Step 3) At this point, we use a density argument such as Weierstrass approximation to show \(\mathcal{A}\) is dense in \(C^2([a,b])\). Specifically, \(\forall f \in C^2([a,b])\), \(\exists \{f_n\}_{n \geq 1} \subset \mathcal{A}\) such that \(f_n \to f\) with respect to some metric \(\rho\).
(Step 4) Finally, it is sufficient to show \(\mathcal{A}\) is closed under this metric \(\rho\). I.e. if \(\{f_n\}_{n \geq 1}\) all satisfy \((P)\) are such that \(f_n \to f\) in \(\rho\), then we have \(f\) also satisfies \((P)\), hence \(f \in \mathcal{A}\).
Remark The reader may already recognize that the sketch above was intentionally phrased in a very general sense, so we can observe the flexibility of the technique. In fact we can even generalize beyond function spaces, as long as we have an equivalent approximation technique.
We start by stating the key theorem.
Theorem (Stone-Weierstrass, Real Numbers) Let \(S\) be a compact Hausdorff space, and \(\mathcal{A} \subset C(S, \mathbb{R})\) an algebra which contains a non-zero constant function. Then \(\mathcal{A}\) is dense in \(C(S, \mathbb{R})\) if and only if it separates points.
Clearly, if we let \(S = B_r\), we have a compact Hausdorff space, and the collections of polynomials contains the functions \(\{1,x\}\) and separates points. Therefore we have \(\mathcal{A}\) is dense in \(C(B_r, \mathbb{R}), \forall r > 0\) with respect to the sup-norm.
Applying the same theorem to the derivatives, we then have the same result for \(C^2(B_r, \mathbb{R})\) with respect to a similar norm
\[\| f \|_{B_r} := \sup_{x \in B_r, \, m = 0,1,2} \left| \frac{\partial^m f}{\partial x^m} (x) \right|.\]proof (of Itô’s Lemma): We will similarly use a localization argument, i.e. define \(\tau_r := \inf \{t \geq 0 : |X_t| + [X]_t > r \}\), and replace \(X_t\) with \(X_{t \wedge \tau_r}\).
(Step 1, 2) Let \(\mathcal{A} \subset C^2(\mathbb{R})\) be the collection of functions where Itô’s Lemma is satisfied. Trivially we have that \(\{1,x\}\) are in \(\mathcal{A}\), and \(\mathcal{A}\) forms a vector space.
Next we show that \(\mathcal{A}\) forms an algebra. In particular, suppose \(f,g \in \mathcal{A}\), and define \(F_t := f(X_t), G_t := g(X_t)\). Using the product rule gives us
\[F_t G_t - F_0 G_0 = \int_0^t F_s dG_s + \int_0^t G_s dF_s + [F,G]_t \,.\]Using the Fundamental Theorem and Itô’s Lemma on \(g\), we get
\[\int_0^t F_s dG_s = \int_0^t f(X_s) \frac{\partial g}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t f(X_s) \frac{\partial^2 g}{\partial x^2}(X_s) d[X]_s \,.\]and observe the same is true switching the order of \(F,G\). Next we use Itô’s Lemma and expand with the Kunita-Watanabe identity to get
\[[F,G]_t = \int_0^t \frac{\partial f}{\partial x}(X_s) \frac{\partial g}{\partial x}(X_s) d[X]_s \, ,\]where the extra terms are zero because the covariation with one finite variation process is zero, i.e. \([ \,[X]\, ,Y ]_t = 0\) as \([X]_t\) has finite variation. By grouping the integrals by the integrators (e.g. \(d[X]_t\)), we get that \(fg\) satisfies Itô’s Lemma or simply \(fg \in \mathcal{A}\).
(Step 3) Here we can apply the Stone-Weierstrass Theorem to get that \(\mathcal{A}\) is dense in \(C^2(B_r)\) with respect to the norm \(\|\cdot\|_{B_r}\).
(Step 4) It remains to show that \(\mathcal{A}\) is closed with respect to \(\|\cdot\|_{B_r}\). In particular, let \((f_n)_{n \geq 1}\) be a sequence in \(\mathcal{A}\) such that \(f_n \to f\) in \(\|\cdot\|_{B_r}\). Then we have
\[\int_0^t \left| \frac{\partial^2 f_n}{\partial x^2}(X_s) - \frac{\partial^2 f}{\partial x^2}(X_s) \right| d[X]_s \leq \|f_n - f\|_{B_r} [X]_t \, .\]At the same time, we also have by Itô’s Isometry
\[\begin{align*} \mathbb{E} \left( \int_0^t \frac{\partial f_n}{\partial x}(X_s) - \frac{\partial f}{\partial x}(X_s) dX_s \right)^2 &= \mathbb{E} \int_0^t \left(\frac{\partial f_n}{\partial x}(X_s) - \frac{\partial f}{\partial x}(X_s) \right)^2 d[X]_s \\ &\leq \|f_n - f\|_{B_r} [X]_t \, . \end{align*}\]Since the process is localized we have that \([M]_t \leq r\), and therefore we can pass the limit in the Itô formula and get
\[\begin{align*} f(X_t) - f(X_0) &= \lim_{n\to\infty} f_n(X_t) - f_n(X_0) \\ &= \lim_{n\to\infty} \int_0^t \frac{\partial f_n}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t \frac{\partial^2 f_n}{\partial x^2}(X_s) d[X]_s \\ &= \int_0^t \frac{\partial f}{\partial x}(X_s) dX_s + \frac{1}{2} \int_0^t \frac{\partial^2 f}{\partial x^2}(X_s) d[X]_s \,. \end{align*}\]Finally, since Itô’s Lemma hold for all \(r>0\), we can simply take \(r\to\infty\) to complete the proof.
\[\tag*{$\Box$}\]Remark Clearly the alternative proof is not necessarily easier, however let us observe a couple of advantages.
Firstly, none of the steps above were very complicated, as most steps followed directly from useful (and well known) propositions. Notably, a first time reader of this subject will have a much easier time following the steps and seeing the bigger picture, rather than getting trapped by technical details.
Secondly, we now have an additional interpretation of the second integral in the formula, which clearly arises as a consequence of Itô’s product rule and Kunita-Watanabe identity. For the readers that have not seen the proof, it follows almost directly from the definition, i.e. a direct consequence of choosing the left Riemann sum.
We have shown the Stone-Weierstrass Theorem is not only a strong result on its own, but leads to a powerful technique in general. In particular, we saw a nice alternative proof of Itô’s Lemma with much better interpretations. Ideally, the author would have liked to add another example, but the post is already quite long at this point. Hopefully the readers will still have enjoyed an interesting blog post, and added another proof technique in their arsenal.
Please comment below (new feature!) for any questions or feedback!
For the longest time, the lemma was credited to Kiyosi Itô alone in his 1950 paper. This was until the 1990s with a resurgence of interests in the late French-German mathematician Wolfgang Doeblin, who was well known to be quite gifted. The interests led to a demand to open the remaining “pli cacheté” (sealed envelope) held by the French Academy of Sciences, which he submitted just before he passed away in 1940 - he burned his notes and took his own life so the German soldiers cannot take advantage of his work. To everyone’s surprise, Doeblin’s letter contained significant research progress ahead of his time, including a statement of the same change of variables formula! To honour his contribution, the result is sometimes referred to as the Itô-Doeblin Lemma.
For the interested readers, I would strongly recommend an excellent commentary by Bernard Bru and Marc Yor (2002) for further details on this topic.
In this blog post, I hope to put together some excellent content I studied recently, specifically from:
We first state a very simple version of the inequality:
Theorem (A Simple Poincaré Inequality) Let \(\Omega \subset \mathbb{R}^n\) be open and bounded, and let \(f \in C^1_c(\Omega)\) (differentiable with compact support). Then there exists a constant \(C\) that depends only on \(\Omega\) such that:
\[ \left\lVert f \right\rVert_{L^2(\Omega)} \leq C \lVert \nabla f \rVert_{L^2(\Omega)} \]
Quick aside: we say a function \(f\) has compact support if the set \(S = \{ x \in \Omega : f(x) \neq 0 \}\) has compact closure. This implies \(f(x) = 0\) near the boundary.
Observe that the inequality simply bounds the \(L^2\)-norm of a function in terms of the \(L^2\)-norm of its gradient instead. Note the compact support here is an important assumption when we are integrating with respect to the Lebesgue measure. Consider for example a constant function, then this inequality would fail as the gradient is zero. The reader may be comforted that a general form will require much fewer assumptions, and can be generalized to all \(L^p\) norms.
The reason we start with this inequality is because the proof is quite straightforward:
proof (of the Simple Poincaré Inequality):
Without loss of generality, we let \(\Omega \subset [0,M]^n\) for some large \(M > 0\), and by the Cauchy-Schwarz inequality we have
\[ \vert f(x) \vert^2 \leq \left\vert \int_0^{x_1} \frac{\partial }{\partial x_i} f(y_1, x_2, \ldots) dy_1 \right\vert^2 \leq \left[ \int_0^M 1^2 dy_1 \right] \left[ \int_0^M \left\vert \frac{\partial f}{\partial x_1} \right\vert^2 dy_1 \right] \]
Summing over all \(n\) possible derivatives, and integrating over \(\Omega\) we have
\[ n \int_\Omega \vert f(x) \vert^2 \leq \int_\Omega \sum_{i=1}^n M \int_0^M \left\vert \frac{\partial f}{\partial x_i} \right\vert^2 dy_i = \sum_{i=1}^n M \left\lVert \frac{\partial f}{\partial x_i} \right\rVert^2_{L^2(\Omega)} \]
where in the last step we exchanged the order of integration, and used the fact that the \(L^2\) norm is a constant. Rewriting the above we get the desired result
\[ \lVert f \rVert_{L^2(\Omega)} \leq \frac{M}{\sqrt{n}} \lVert \nabla f \rVert_{L^2(\Omega)} \]
\[\tag*{$\Box$}\]We now state the inequality in a form most useful for probability theory, see Theorem 3.20 from Boucheron, Lugosi, Massart (2013):
Theorem (Gaussian-Poincaré Inequality) Let \(X = (X_1, \ldots, X_n)\) be a vector of i.i.d. standard Gaussian random variables. Let \(\, f : \mathbb{R}^n \to \mathbb{R}\) be any continuously differentiable function. Then
\[ \text{Var}[f(X)] \leq \mathbb{E}\left[ | \nabla f(X)|^2 \right] \]
Observe that the inequality is slightly different. Firstly this time the norm is centered, although centering in this case is not an issue since \(Var[X] \leq \mathbb{E}X^2\). Secondly due to the measure being a probability measure, we have a much smaller constant on the inequality \(C=1\). In combination, we were also able to drop the compact support assumption.
An immediate consequence is to consider \(f\) Lipschitz with coefficient \(1\), i.e. \(| f(x) - f(y) | \leq \|x - y\|\), then we have
\[ \text{Var}[f(X)] \leq 1 \]
In other words, we just found a constant bound on the variance for a huge class of random functions! In general, we can consider \(f\) to be a smooth estimator based on a dataset with noise \(X\). The Poincaré inequality will provide a very useful bound on estimation error.
To prove this inequality, we will use a famous result from 1981 (Theorem 3.1 in Boucheron, Lugosi, Massart (2013)):
Theorem (Efron-Stein Inequality) Let \(X = (X_1, \ldots, X_n)\) be a vector of i.i.d. random variables and let \(Z = f(X)\) be a square-integrable function of \(X\). Then
\[ \text{Var}(Z) \leq \sum_{i=1}^n \mathbb{E} \left[ \left( Z - \mathbb{E}^{(i)}Z \right)^2 \right] \]
where \(\mathbb{E}^{(i)}Z = \int f(X_1, \ldots, X_{i-1}, x_i, X_{i+1},\ldots) d\mu_i(x_i)\), i.e. the expectation over \(X_i\) only.
The Efron-Stein inequality can be proved by decomposing the variance as a sum of telescoping differences of conditional expectations, and applying Jensen’s inequality to the individual terms. While we omit the proof here, we should remark that the simple Efron-Stein inequality has wide ranging applications; we will only look at one such use for the proof of the Poincaré inequality, taken from Theorem 3.20 in Boucheron, Lugosi, Massart (2013):
proof (of Gaussian-Poincaré Inequality):
First we observe that a direct application of the Efron-Stein inequality can reduce the problem down to \(n=1\), i.e. it is sufficient to show
\[ \mathbb{E}^{(i)} \left[ \left( Z - \mathbb{E}^{(i)}Z \right)^2 \right] \leq \mathbb{E}^{(i)} \frac{\partial f}{\partial x_i}(X)^2 \]
From here we assume without loss of generality \(n=1\). Then we notice that it is sufficient to prove this inequality for compactly supported, twice differentiable functions, i.e. \(f \in C_c^2(\mathbb{R})\), since otherwise we can just take a limit to the original function.
Here we let \(\epsilon_1,\ldots,\epsilon_n\) be i.i.d. Rademacher random variables, i.e. \(\mathbb{P}[\epsilon_j = 1] = \mathbb{P}[\epsilon_j = -1] = \frac{1}{2} \,\forall j \in \{ 1,2,\ldots,n \}\), and we define
\[ S_n = n^{-1/2} \sum_{j=1}^n \epsilon_j \]
Observe that for every \(i\) we have
\[ \text{Var}^{(i)}[f(S_n)] = \frac{1}{4} \left[ f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right]^2 \]
Applying the Efron-Stein inequality, we get
\[ \text{Var}[f(S_n)] \leq \frac{1}{4} \sum_{i=1}^n \mathbb{E} \left[ \left( f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right)^2 \right] \]
Let \(K = \sup_x \vert f''(x) \vert\), then we have that
\[ \left|f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right)\right| \leq \frac{2}{\sqrt{n}} |f’(S_n)| + \frac{2K}{n} \]
which implies
\[ \frac{n}{4} \left( f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right)^2 \leq f’(S_n)^2 + \frac{2K}{\sqrt{n}} | f’(S_n) | + \frac{K^2}{n} \]
Finally the central limit theorem then imply the desired result
\[ \limsup_{n\to\infty} \frac{1}{4} \sum_{i=1}^n \mathbb{E} \left[ \left( f\left( S_n + \frac{1-\epsilon_i}{\sqrt{n}} \right) - f\left( S_n + \frac{1+\epsilon_i}{\sqrt{n}} \right) \right)^2 \right] = \mathbb{E} \left[ f’(X)^2 \right] \]
\[\tag*{$\Box$}\]Remark There are also Poincaré type inequalities for non-Gaussian random variables, for example if \(X\sim\)Poisson\((\mu)\):
\[ \text{Var}[f(X)] \leq \mu \mathbb{E}\left[ (f(X+1) - f(X))^2 \right] \]
Or if \(X\) is double exponential i.e. with density \(\frac{1}{2}e^{-\vert x \vert}\), then we have:
\[ \text{Var}[f(X)] \leq 4 \mathbb{E}\left[ (f’(X))^2 \right] \]
To do proper justice for the theory of PDEs, we will need a significant background in functional analysis. In this section, we will try to side-step the technical details and focus on one single application, that is showing the existence and uniqueness of a weak solution for Poisson’s equation:
\[ -\Delta u = f \; \text{ in } \Omega \] \[ u = g \; \text{ in } \partial \Omega \]
where \(\Omega \subset \mathbb{R}^n\) is open bounded with smooth boundaries \(\partial \Omega\). By weak solution, we meant there exists a \(u \in C^1(\Omega)\) such that \(\forall v \in C^1_c(\Omega)\) we have
\[ B[u,v] := \int_\Omega \nabla u \cdot \nabla v = \int_\Omega f v \]
Note if \(u\) is a solution to the (original) Poisson’s equation, then we have the above weak equation by Green’s identity. The main tool we will use to prove existence and uniqueness is the following result:
Theorem (Lax-Milgram) Let \(H\) be a Hilbert space, \(B: H^2 \to \mathbb{R}\) be a continuous, coersive, bilinear form. Then \(\forall \varphi \in H^*\), there exists a unique \(u\in H\) such that
\[ B(u,v) = <\varphi, v> \quad \forall v \in H \]
where \(<\varphi,v>\) is the linear functional \(\varphi\) applied to \(v\).
Remark Before going into the definitions and technical details, we observe that Lax-Milgram Theorem gives us exactly what we want - the existence and uniqueness! Now we just have to fill in the blanks:
Step 1 To define our Hilbert space, we will consider the following inner product:
\[ (u,v) := \int_\Omega [uv + \nabla u \cdot \nabla v] \]
which corresponds to the following Sobolev norm:
\[ \lVert u \rVert_{H^{1}(\Omega)} := (u,u)^{1/2} = \left[ \lVert u \rVert_{L^2(\Omega)}^2 + \lVert \nabla u \rVert_{L^2(\Omega)}^2 \right]^{1/2} \]
By equipping the space \(C^1_c(\Omega)\) with the above inner product, we almost have a Hilbert space! Here we will simply take the completion of \(C^1_c(\Omega)\) with respect to the Sobolev norm, i.e. add all the limit points to the space. We call this (completed) Hilbert space \(H_0^1(\Omega)\).
Step 2 We now turn our attention to \(B(u,v)\), the bilinear form (fancy term for separately linear in both inputs). Then we say \(B\) is continuous if
\[ \exists C_1 > 0 : \forall u,v \in H, \vert B(u,v) \vert \leq C_1 \lVert u \rVert_{H^1(\Omega)} \lVert v \rVert_{H^1(\Omega)} \]
Note this is an immediate consequence of Cauchy-Schwarz inequality
\[ \vert B(u,v) \vert \leq \lVert \nabla u \rVert_{L^2(\Omega)} \lVert \nabla v \rVert_{L^2(\Omega)} \leq \lVert u \rVert_{H^1(\Omega)} \lVert v \rVert_{H^1(\Omega)} \]
We say \(B\) is coersive if
\[ \exists C_2 > 0 : \forall u \in H, B(u,u) \geq C_2 \lVert u \rVert_{H^1(\Omega)}^2 \]
We notice this is the only non-trivial condition left to check, and to prove this we will finally use Poincaré inequality! Start by rewriting
\[ B(u,u) = \int_\Omega \nabla u \cdot \nabla u = \lVert \nabla u \rVert_{L^2(\Omega)}^2 \]
Applying the Poincaré inequality on half of the norm we have
\[ \frac{1}{2} \lVert \nabla u \rVert_{L^2(\Omega)}^2 \geq \frac{1}{2C} \lVert u \rVert_{L^2(\Omega)}^2 \]
Therefore
\[ B(u,u) \geq \frac{1}{2} \lVert \nabla u \rVert_{L^2(\Omega)}^2 + \frac{1}{2C} \lVert u \rVert_{L^2(\Omega)}^2 \geq \min\left(\frac{1}{2}, \frac{1}{2C}\right) \lVert u \rVert_{H^1(\Omega)}^2 \]
And voilà, we have existence and uniqueness! A rigorous and careful reader may notice that \(u\) does not necessarily have compact support - this is correct. However every \(u \in H_0^1(\Omega)\) is a limit of compactly supported functions, therefore we just need to take a limit to get our result!
Remark In fact, we can use similar Lax-Milgram based methods to show existence and uniqueness for a large subset of elliptical PDEs. We should note that the fact we can “convert” between \(\|u\|\) and \(\|\nabla u\|\) is highly useful for studying Sobolev norms. We refer curious readers to Evans (2010) for an excellent chapter on Sobolev spaces and related inequalities.
I have a weak spot for connections between different fields, probably because it’s always surprising, and surprises are intriguing in math! I hope to have to presented a readable introduction to the inequality and its applications in both topics, without drowning readers in technical details. On this note, I should remark that to study Sobolev spaces rigorously, the reader will need to go through all the details carefully!
As this is my first blog post, any constructive feedback or suggestions on future topics will be appreciated!