Moment Generating Functions
Moment Generating Functions
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Moment Generating Functions
Definition
The moment generating function (MGF) of a random variable
X is defined as the function
mX (t) = E[etX ]
(for all t ∈ R for which this expectation exists).
Why is it called “moment generating function”?
Proposition
It holds that
m
(n)
X (0) = E[X
n], n ∈ N
where m
(n)
X is the n-th (calculus) derivative of mX (·) w.r.t. t.
For example: m′X (0) = E[X ].
PSTAT W 160A N. Detering & M. Ludkovski
Example
Compute the MGF of X ∼ Bernoulli(p), p ∈ (0, 1);
PSTAT W 160A N. Detering & M. Ludkovski
Properties
Proposition
1. If X and Y are independent, then the MGF of X + Y is the
product of their respective MGFs:
mX+Y (t) = mX (t) ·mY (t).
2. For constants a, b ∈ R and Y = a+ b · X :
mY (t) = e
at ·mX (b · t).
3. MGF uniquely determines the (probability) distribution of a random variable.
That is, if there exists a δ > 0 such that
mX (t) = mY (t) for all t ∈ (−δ, δ)
then X and Y are equal in distribution (Notation: X
d
= Y ).
PSTAT W 160A N. Detering & M. Ludkovski
Example
Compute the MGF of
(a) X ∼ Bin(n, p);
(b) X ∼ N (µ, σ2), using the fact that mZ (t) = et2/2 for Z ∼ N (0, 1).
Observe: MGF of N (µ, σ2)-distribution is much simpler than its cdf.
PSTAT W 160A N. Detering & M. Ludkovski
Proposition (Continuity Theorem)
Let X1,X2,X3, . . . be a sequence of random variables with
corresponding MGFs mX1 ,mX2 , . . .. Assume that X is a
random variable such that for all t ∈ R
mXn(t)→ mX (t) as n→∞.
Then,
lim
n→∞P[Xn ≤ x ] = P[X ≤ x ] (⋆)
at each x for which FX is continuous.
If (⋆) holds we say that “X1,X2,X3, . . . converge in distribution to X as n goes to
infinity”. As a consequence:
P[Xn ∈ A] ≈ P[X ∈ A] (for n very large).
An important application of the Continuity Theorem: Proof of the Central Limit
Theorem
PSTAT W 160A N. Detering & M. Ludkovski
Moment Generating Functions
Take-Away:
▶ Moment generating functions (MGFs) are a versatile tool for identifying the
distribution of
▶ sums of (independent) random variables, or
▶ the distribution of the limit of a sequence of random variables
▶ provide alternative way to characterize the distribution instead of looking at cdfs,
pdfs, pmfs
▶ e.g.: given a random variable X . If its MGF looks like a Gaussian MGF, then X is
Gaussian!
PSTAT W 160A N. Detering & M. Ludkovski
Markov and Chebyshev
PSTAT W 160A N. Detering & M. Ludkovski
Estimating Tail Probabilities
Motivation:
Estimate
P[X ≥ a]
in case when we do not know the distribution of the random variable X .
It’s possible to obtain upper bounds in terms of the
expectation and variance of X .
PSTAT W 160A N. Detering & M. Ludkovski
Proposition (Markov’s Inequality)
Let X be a nonnegative random variable. Then for any c > 0
P[X ≥ c] ≤ E[X ]
c
.
Idea of Proof:
E[X ] = E[X1{0≤X≤c} + X1{X≥c}] ≥ P(0 · 1{0≤X≤c} + c · 1{X≥c}) = cP(X ≥ c)
PSTAT W 160A N. Detering & M. Ludkovski
Markov’s Inequality used just the first moment of X .
The next result uses also the variance.
Proposition (Chebyshev’s Inequality)
Let X be a random variable with a finite mean and finite variance. Then for any c > 0
we have
P(|X − E[X ]| ≥ c) ≤ Var(X )
c2
.
Use c = kStDev(X ) to get that the prob that X is more than k standard devs away
from its mean is at most 1/k2.
PSTAT W 160A N. Detering & M. Ludkovski
Markov vs Chebyshev
Example
Class average on the test was 70% and st. dev was 10%.
Estimate the fraction of students who scored over 80%.
(a) Using Markov’s inequality
(b) Using Chebyshev’s inequality
(c) Using Gaussian approximation
PSTAT W 160A N. Detering & M. Ludkovski
Example
An ice cream parlor has, on average, 1000 customers per day.
(a) What can be said about the probability that it will have
at least 1400 customers tomorrow?
Now, suppose that the variance of the number of customers to the ice cream parlor on
a given day is 200.
(b) What can be said about the probability that there will be more than 950 and less
than 1050 customers tomorrow?
(c) What can we now say about the probability that at least 1400 customers arrive?
PSTAT W 160A N. Detering & M. Ludkovski
Law of Large Numbers and Monte Carlo
PSTAT W 160A N. Detering & M. Ludkovski
Laws of Large Numbers
Fixed Setup:
Sequence of i.i.d. random variables X1,X2, . . . ,Xn, . . . with
mean E[X1] = µ ∈ R and variance Var(X1) = σ2 ∈ R+.
Take their Sum: Sn = X1 + . . .+ Xn =
∑n
i=1 Xi (for all n ≥ 1). Empirical
Mean/Average:
X¯ n :=
Sn
n
=
1
n
n∑
i=1
Xi (for all n ≥ 1).
For all n ≥ 1 we have:
E[Sn] = n · µ, Var[Sn] = n · σ2
E[X¯n] = µ, Var(X¯n) =
σ2
n
.
PSTAT W 160A N. Detering & M. Ludkovski
Standardized Sum:
S∗n :=
Sn − E[Sn]√
Var(Sn)
=
Sn − n · µ√
n · σ2 (for all n ≥ 1).
Equivalently:
S∗n =
1
n · Sn − 1n · n · µ
1
n ·
√
n · σ2 =
X¯n − µ√
1
n · σ2
=
√
n · X¯n − µ
σ
.
Check:
E[S∗n ] = 0, Var(S∗n ) = 1 (for all n ≥ 1).
Goal: Estimate X¯n and S∗n for n very large
PSTAT W 160A N. Detering & M. Ludkovski
Proposition (Weak Law of Large Numbers (WLLN))
For any fixed ε > 0 we have
lim
n→∞P
[∣∣∣Sn
n
− µ
∣∣∣ ≥ ε] = 0
We say that “the sequence Sn/n = X¯n converges in probability to µ as n goes to
infinity”.
PSTAT W 160A N. Detering & M. Ludkovski
Proposition (Strong Law of Large Numbers (SLLN))
We have
P
[
lim
n→∞
Sn
n
= µ
]
= 1
Alternative formulation:
There exists a set of scenarios A ⊂ Ω such that P[A] = 1 and
lim
n→∞
Sn(ω)
n
= lim
n→∞ X¯n(ω) = µ for all ω ∈ A.
We say that “the sequence Sn/n = X¯n converges almost surely (”with probability
1”) to µ as n goes to infinity”.
PSTAT W 160A N. Detering & M. Ludkovski
Weak vs. Strong:
▶ WLLN only guarantees that for large n the outcome of
the random variable Sn/n will be close to µ with high probability.
▶ SLLN guarantees that for every scenario (with probability 1)
the random variable Sn/n will converge to µ
(hence be very close to it for a large n).
▶ SLLN is the more relevant result in practice
▶ Most important application: Monte Carlo Simulation
PSTAT W 160A N. Detering & M. Ludkovski
Application of SLLN: Monte Carlo Method
Suppose that X is a given random variable
with unknown distribution
Goal: We want to find µ, the expected value of X :
E[X ] = µ
(or at least an approximation)
Idea: Approximate E[X ] via X¯n = (X1 + . . .+ Xn)/n for n large by
using the Strong Law of Large Numbers! Xn’s – i.i.d. samples of X
PSTAT W 160A N. Detering & M. Ludkovski
Monte Carlo approximation
Suppose we can generate a sequence of independent samples
x1, x2, x3 . . . drawn from the same distribution as X .
Then: The average (empirical mean or sample mean) satisfies
x¯n =
x1 + . . .+ xn
n
→ E[X1] = µ for n→∞.
by the SLLN.
Hence: For n (= number of samples) very large
x¯n =
1
n
(x1 + . . .+ xn) ≈ µ
is a good approximation.
PSTAT W 160A N. Detering & M. Ludkovski
Learning Probabilities
The same idea works for approximating probabilities:
Goal: Want to find:
P[X ∈ A]
We have:
P[X ∈ A] = E[1{X∈A}] ≈
1
n
n∑
i=1
1{Xi∈A} (n very large)
where X1,X2, . . . ,Xn are independent samples with same distribution as X .
PSTAT W 160A N. Detering & M. Ludkovski
Example (Coin flips)
Estimate the probability p ∈ [0, 1] of tossing heads.
Toss the coin n times. Then
# of heads in n flips
n
→ p as n→∞
by the Strong Law of Large Numbers.
Note: This is in line with the relative frequency interpretation of probabilities
probability of an long-term proportion
event to occur = of times that event occurs
in repeated trials
PSTAT W 160A N. Detering & M. Ludkovski
Simulating π
▶ Draw the unit square and
the quarter circle inside it.
▶ Throw a dart uniformly
onto the square
▶ If it lands inside the
quarter-circle, call it a
success. Outside it’s a
failure.
PSTAT W 160A N. Detering & M. Ludkovski
Simulating π
▶ P(success) = π/4
▶ Do this 1 million times. Then the fraction of successes will be very close to π/4.
▶ Can estimate π using probability: X ∼ Unif (0, 1),Y ∼ Unif (0, 1), independent
▶ P(success) = P(X 2 + Y 2 ≤ 1).
▶ This method works for ANY shape and any dimension. So for any shape A we can
estimate Area(A) with the above technique.
▶ Commonly used for very complex shapes in very high dimensions (d ≫ 1000)
PSTAT W 160A N. Detering & M. Ludkovski
Estimating π Again
Example (Estimation of π)
The integral
∫ 1
0
√
1− x2dx gives 1/4 of the area of the unit circle, so
4
∫ 1
0
√
1− x2dx = π = 3.1415926...
This gives another Monte Carlo estimator for π:
Sample Xi ∼ Unif (0, 1) i.i.d.. Take
I¯n = 4 · 1
n
n∑
i=1
√
1− X 2i .
PSTAT W 160A N. Detering & M. Ludkovski
Characteristic Function
▶ More generally, define the characteristic function of X as
φX (t) = E[e itX ]
where i =
√−1 the imaginary unit:
▶ φX always exists for all t ∈ R and all random variables X .
▶ shares the same properties as the MGF:
φ
(k)
X (0) = i
k · E[X k ].
PSTAT W 160A N. Detering & M. Ludkovski
Central Limit Theorem
PSTAT W 160A N. Detering & M. Ludkovski
The Central Limit Theorem
Proposition (Central Limit Theorem (CLT))
Suppose that we have a sequence of i.i.d. random variables
X1,X2, . . . with mean µ ∈ R and variance σ2 ∈ R+.
Let Sn = X1 + . . .+ Xn for all n ≥ 1 and S∗ be the standardized sum.