I got stuck on what I thought was a simple point while working through
a stats textbook! I figured it out, I think, but was this the most
straightforward path to understanding? You tell me! In this note, we
explore some oft-used facts about moment-generating functions, and
explain a connection with tail and concentration inequalities.
Overview and my confusion
I am currently reading through Martin Wainwright’s High-dimensional
statistics: A non-asymptotic viewpoint, to fill some gaps in my stat
theory knowledge. While I’m only partway through, I can definitely
already recommend this book for those wanting to get familiar with
some of the modern theoretical machinery needed to validate
statistical methods that we use on difficult problems!
The book starts off relatively lightly in Chapter 2 with an overview
of fundamental tail and concentration bounds. A simple but powerful
idea is that of Chernoff bounding, which combines the moment
generating function of a random variable (when it exists), and
Markov’s inequality to tightly bound tail probabilities. In reading
through this chapter, one quickly comes across the sub-exponential
class of distributions: a random variable X is sub-exponential if
E(∣X∣)<∞ and there exist constants ν>0 and
α∈[0,∞) such that E(et(X−μ))≤e2ν2t2,for all t∈(−α1,α1), where μ:=E(X) and we use the convention that 01=∞.
Following this introduction, we are then exposed to a sufficient
condition for X to be sub-exponential, the so-called Bernstein
condition. In particular, X is said to be b-Bernstein, for some
b>0, if E(∣X∣2)<∞, and letting μ:=E(X), σ2:=Var(X), it holds that
∣∣E[(X−μ)k]∣∣≤21k!σ2bk−2, for all k∈N∖{1}. In
proving that this condition implies that X is sub-exponential, the
author employs the “power series decomposition” E(et(X−μ))=∑n=0∞n!tnE[(X−μ)n],
for t non-zero but sufficiently small. That said, I couldn’t find
anywhere in the book where this equality was justified.
At first glance, this equality might seem obvious, as ez=∑n=0∞n!zn for all z∈R, so one
should just be able to push the expectation through this power series
representation. But usually this sort of statement is justified by the
dominated convergence theorem (DCT): if Yn→Y, and the sequence
Yn is bounded by Z∈L1, i.e., ∣Yn∣≤Z for all n∈N and E(Z)<∞, then E(∣Yn∣),E(∣Y∣)<∞ and E(Y)=limn→∞E(Yn). In our problem, can we just directly apply the DCT?
Well, we would like to say that, for small but non-zero t,
E(n→∞limk=0∑nk!tk(X−μ)k)=n→∞limE(k=0∑nk!tk(X−μ)k),
i.e., Yn=∑k=0nk!tk(X−μ)k and Y=et(X−μ). But what should be the dominating variable Z? We can
easily see that Yn≤∑k=0nn!∣t∣n∣X−μ∣n≤e∣t∣∣X−μ∣, but do we know that the latter is integrable? It
wasn’t obvious to me, so let’s dive into the details.
Facts about moment generating functions
Let X be a random variable, and define ΨX(t)=E(etX), for t∈D, where D={t∈R:E(etX)<∞}. Note that etX≥0, so that its
expectation is always well-defined in [0,∞]. Moreover, 0∈D no matter what, as E(e0X)=1<∞ — this could
possibly be the only element of D. The first interesting fact about
ΨX is that its domain D is always an interval including zero.
Lemma 1: If t∈D∩(0,∞), then [0,t]⊆D. Similarly, if t∈D∩(−∞,0), then [t,0]⊆D.
Proof: Suppose t>0, E(etX)<∞, and let s∈[0,t]. Observe that E(esX)=E(esX1[0,∞)(X))+E(esX1(−∞,0)(X))≤E(etX1[0,∞))+P[X<0]≤ΨX(t)+1<∞, so s∈D. Similarly, if t<0,
E(etX)<∞, then let s∈[t,0] and note that
E(esX)=E(esX1(0,∞)(X))+E(esX1(−∞,0](X))≤P[X>0]+E(etX1(−∞,0](X))≤1+ΨX(t), so s∈D. ■
MGFs are useful when they exist on an open interval around zero — it
is in this case that they actually generate moments! This is
formalized in the next result.
Proposition 1: Suppose t∗>0 is such that (−t∗,t∗)⊆D. Then E(∣X∣n)<∞, for all n∈N,
ΨX(t)=n=0∑∞n!tnE(Xn)
holds and this series is absolutely convergent, for all t∈(−t∗,t∗). Consequently, E(Xn)=dtndnΨX(t)∣∣t=0.
Proof: Let t∈(0,t∗). The key observation is that et∣x∣≤e−tx+etx, for all x∈R. Thus,
E(et∣X∣)≤ΨX(−t)+ΨX(t)<∞, given
that −t,t∈D. In other words, by the power series expansion of
the exponential, E(et∣X∣)=E(n=0∑∞n!tn∣X∣n)=n=0∑∞n!tnE(∣X∣n)<∞,
where the interchange of expectation and summation is permitted by the
monotone convergence theorem (MCT). Immediately, we can deduce that
E(∣X∣n)<∞ for all n∈N. We further
can see by the DCT that ΨX(t)=∑n=0∞n!tnE(Xn) for any t∈(−t∗,t∗), as the partial sums
∑k=0Nn!tnXn, for N∈N, are
uniformly bounded by e∣t∣⋅∣X∣∈L1.
Notice that we have shown that the power series ∑n=0∞n!tnE(Xn) converges absolutely on (−t∗,t∗),
i.e., its radius of convergence is at least t∗. By basic facts
about power series, ΨX(t) is infinitely differentiable on
(−t∗,t∗), and E(Xn)=dtndnΨX(t)∣∣t=0. ■
Inspecting the proof of Proposition 1, we can see that the real power
(no pun intended) comes from e∣tX∣ being integrable. The crux of
the argument was that {−t,t}⊆D is equivalent to
E(e∣tX∣)<∞, and the latter readily establishes
that ΨX permits the power series representation ΨX(s)=∑n=0∞n!snE(Xn), for s∈{−t,t}. With Lemma 1, we pretty much immediately obtain the following
nice chain of equivalences.
Theorem: For anyt∈R:
⟺⟺{−t,t}⊆D (or say [−∣t∣,∣t∣]⊆D if you like)E(e∣sX∣)<∞,for all s∈[−∣t∣,∣t∣]n=0∑∞n!tnE(Xn) absolutely converges and equals ΨX(t) for t∈[−∣t∣,∣t∣]
[Notice that this is useful only when t=0.]
Relation to sub-exponential distributions
At this point, we are equipped to understand a more general version of
the problem we started out with (side note: are the necessary facts
about MGFs just supposed to be common knowledge!?). Namely, we will
verify that X is sub-exponential if and only if the domain D of
ΨX contains an open interval around zero [for those reading the
Wainwright text, this is part of Theorem 2.13]. Clearly, if ν>0
and α∈[0,∞) satisfy E(et(X−μ))≤e2ν2t2 (notice in particular that this is finite)
for t∈(−α−1,α−1), then D contains any pair
{−t,t} for t∈(0,α−1), and we see that
(−α−1,α−1)⊆D. The following says that the
converse result also holds.
Proposition 2: The MGF ofXis defined onD⊇(−c,c)for somec>0if and only ifXis sub-exponential.
Proof: We have just argued in the “if” direction. For the converse,
suppose c>0 and (−c,c)⊆D. By our theorem, it is kosher
to use the power series expansion of ΨX−μ(t) for any t∈(−c,c) — note by the way that the domain D of ΨX is
equal to that of the MGF of the centered variable ΨX−μ, as
they differ by a deterministic finite multiplicative factor
e−tμ. Further, as the radius of convergence of this power
series is at least c, we can differentiate term by term on (−c,c)
to obtain the derivative of ΨX−μ. Consequently, by
Taylor’s theorem with Peano remainder, ΨX−μ(t)=ΨX−μ(0)+ΨX−μ(1)(0)t+2ΨX−μ(2)(0)t2+o(t2)=1+2σ2t2+o(t2), as t→0, where σ2=E[(X−μ)2]; recall
that all moments are finite by Proposition 1. Similarly, for any
ν>0,
exp{2ν2t2}=[1+{ν2sexp{2ν2s2}}t+21{ν2exp{2ν2s2}(1+ν2s2)}t2]s=0+o(t2)=1+2ν2t2+o(t2), as t→0,
as ex is smooth. Combining these facts, we obtain for any ν>0, exp{2ν2t2}−E(et(X−μ))=2t2(ν2−σ2)+o(t2), as t→0. To make sure this exceeds zero, at least for small t, we will
choose an arbitrary ν>σ. Meanwhile, choose δ>0
such that whenever ∣t∣<δ, ∣∣t21(exp{2ν2t2}−E(et(X−μ)))−2ν2−σ2∣∣<ϵ, where
we choose ϵ=4ν2−σ2>0. We then can see
that ϵt2+E(et(X−μ))≤exp{2ν2t2} for all t∈(−δ,δ), so X is (ν,δ1)-sub-exponential. ■
Control of the moments and Bernstein’s condition
Viewing the formula for the MGF as a power series, we can obtain one
more useful characterization of its existence in an open interval
around zero.
Proposition 3: The MGF ofXis defined onD⊇(−c,c)for somec>0, if and only ifγ:=n→∞limsup(n!∣E(Xn)∣)1/n<∞.
Proof: By our theorem, (−c,c)⊆D for some c>0 if and
only if Xn∈L1 for all n∈N and the radius of
convergence R of the power series ∑n=0∞n!tnE(Xn) is positive. But R=γ1 by standard
power series facts, so we conclude that R>0 if and only if γ<∞. ■
With this result in hand, we can more easily see that the Bernstein
condition implies the sub-exponential property. Indeed, if X is
b-Bernstein for some b>0, then n→∞limsup(n!∣E[(X−μ)n]∣)1/n≤bn→∞limsup(2b2σ2)1/n=b<∞. Thus, (−b1,b1)⊆D, again using that the domains of ΨX and ΨX−μ are both D. By Proposition 2, we are done!
BONUS: A direct analysis of the Bernstein condition
In talking through this with some peers at CMU, I learned (thanks to
Tiger Zeng) that we could also directly verify that the Bernstein
condition implies E(e∣tX∣)<∞ for t∈(−b1,b1). At that point we could just
use the DCT to justify the interchange of expectation and sum. Since
the idea was pretty clever, I’ve decided to write it out here.
Suppose X is b-Bernstein for b>0. We wish to show that
E(e∣t∣⋅∣X−μ∣)<∞ for any t∈(−b1,b1), and recall that by the Taylor
series for ex and the MCT, this expectation always equals
∑n=0∞n!∣t∣nE(∣X−μ∣n). The
difficulty comes from the fact that the Bernstein condition does not
speak directly to E(∣X−μ∣n), but rather to
∣E[(X−μ)n]∣. For n even, these coincide,
but for n odd, some care is needed.
Explicitly separating the even and odd terms, we have
E(e∣t∣⋅∣X−μ∣)=k=0∑∞(2k)!∣t∣2kE(∣X−μ∣2k)+k=0∑∞(2k+1)!∣t∣2k+1E(∣X−μ∣2k+1). We show
these sums are both finite in turn. Starting with the even terms,
since X is b-Bernstein, k=0∑∞(2k)!t2kE[(X−μ)2k]=1+k=1∑∞(2k)!t2kE[(X−μ)2k]≤1+2σ2k=1∑∞t2kb2k−2=1+1−t2b2σ2t2/2, by summing the geometric
series, noting that t2b2=(∣t∣b)2<1 by assumption. For the
odd terms, the trick is to forcefully introduce even moments using
Cauchy-Schwarz: for any k∈N, E(∣X−μ∣2k+1)=E(∣X−μ∣k+1∣X−μ∣k)≤{E[(X−μ)2k+2]E[(X−μ)2k]}1/2, by Cauchy-Schwarz. Thus, as X is
b-Bernstein, E(∣X−μ∣2k+1)≤2σ2b2k−1(2k)!(2k+2)!≤(2k+1)!σ2b2k−1, since (2k)!(2k+2)!=(2k+1)!2k+12k+2<2(2k+1)! for all k∈N. Therefore,
k=0∑∞(2k+1)!∣t∣2k+1E(∣X−μ∣2k+1)≤E(∣X−μ∣)+σ2k=1∑∞∣t∣2k+1b2k−1=E(∣X−μ∣)+1−∣t∣bbσ2∣t∣3. Putting
everything together, we have E(e∣t∣⋅∣X−μ∣)<∞. Phew!