S611 Applied Statistical Computing
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT - S611
Applied Statistical Computing
Section 12: Basic Optimization
Maximum Likelihood Estimation
Consider a data vector X = (X1, . . . Xn)
Xi
iid∼ p(⋅∣θ) i = 1, . . . , n,
where θ is a p × 1 column vector. The joint density of X is
p(x1, . . . , xn∣θ) = n∏
i=1
p(xi∣θ) = Lx(θ).
When seen as a function of θ, the function LX (θ) is a random (we
get a different one for each realization of X. For a particular
sample, we call this function the “likelihood function”.
Maximum Likelihood Estimation
A popular way of estimating θ from a sample X is to choose the
number (vector) θˆ that maximizes the likelihood:
θˆ = arg max
θ
LX (θ).
This is known as a “maximum likelihood estimator” (MLE).
In general, it’s easier to work with the log-likelihood:
lx(θ) = logLX (θ).
Since log is a monotonic function, any θ that maximizes lx(θ) also
maximizes Lx(θ).
Score and Information matrix
▶ The function l˙x(θ) is known as the score function.
▶ The MLE often (when l¨(x)∣θˆ < 0) satisfies the likelihood
equation:
l˙(θ)»»»»»θ=θˆ = 0.
▶ The Fisher information matrix in X about θ is defined as
I(θ) = [−Eθ l¨X (⋅)]θ .
where the expectation is taken over X assuming X ∼ p(⋅∣θ).
Under some mild regularity conditions we have that
I(θ) = Eθ [l˙2x(⋅)]θ
Score and Information matrix
▶ Under regularity conditions when the true value of θ is θ0, the
asymptotic variance of θˆ is
Vθ0(θˆ) = [−Eθ l¨X (⋅)]−1»»»»»»θ0 = I(θ0)−1.
▶ This requires θ0 known. When it’s unknown we can estimate
it via
Estimated F.I. ∶ [I(θˆ)]− 12 ;
Observed F.I. ∶ [−l¨x(θˆ)]− 12 .
Score and Information Matrix
▶ For multivariate θ = (θ1, θ2, . . . , θp)T we have
l˙(θ) = (∂l(θ)
∂θ1
, . . .
∂l(θ)
∂θp
)
[l¨(θ)]ij = ∂2l(θ)∂θi∂θj .
▶ For those points where l achieves its maximum we have that
l˙(θ) = 0 and I(θ) is negative definite.
Solving the MLE problem: Newton’s method and Scoring
Newton’s Method:
Let f ∶ R → R be twice differentiable and bounded from below. We
want to find θ
∗
where f attains its minimum
1
Newton’s method is based on a local quadratic approximation.
Consider a small neighborhood of a current guess, θk. We can get
a second-order Taylor approximation:
f (θ) ≈ f (θk) + (θ − θk)f ′(θk) + 12 (θ − θk)2f ′′(θk)
= fθk (θ).
1
Although we really want to the location of the maximum, the usual
language in optimization calls from fining minimums. For this we only need to
consider taking the negative of our objective function.
Newton’s method
Instead of trying to minimize directly f (θ), we try with fθk (θ).
Since this is a quadratic function, this minimum always exists and
can be located by taking the first derivative rt to θ,
dfθk (θ)
dθ
= f ′(θk) + (θ − θk)f ′′(θk),
and finding θ = θk+1 that makes it equal to zero
f
′(θk) + (θk+1 − θk)f ′′(θk) = 0
θk+1 − θk = −
f
′(θk)
f ′′(θk)
θk+1 = θk −
f
′(θk)
f ′′(θk) .
We now take θk+1 as our current guess, and find a θk+2 following
the same schema until convergence.
Multidimensional Target Functions
when f ∶ Rp → R this schema generalizes to
θk+1 = θk − [∇2f (θk)]−1 ∇f (θk),
where ∇f ∶ Rp → Rp and ∇2f ∶ Rp → Rp×p are the gradient and
Hessian matrix of f , respectively.
Example
consider the normal model Xi
iid∼ N (µ, σ2) The likelihood is
p(x1, . . . , xn∣µ, σ2) = n∏
i=1
N (yi ∣ µ, σ2) .
so the negative log-likelihood up to a constant is
l(µ, σ2) = − n∑
i=1
log N (yi ∣ µ, σ2) = − n∑
i=1
log ( 1√
2piσ2
e
− 1
2
(xi−µ
σ
)2)
≡ n
2
log σ
2 +
1
2σ2
n
∑
i=1
(yi − µ)2
= n
2
log σ
2 +
1
2σ2
[(n − 1)s2 + n(y¯ − µ)2] .
where s
2 = 1
n−1
∑ni=1(yi − y)2, y = 1n ∑ni=1 yi
Example
Since σ
2 > 0, we reparametrize to γ = log σ2. Then
−l(µ, γ) = n
2
γ + e−γ [(n − 1)s2 + n(y − µ)2] .
∇f (µ, γ) = [ n(µ − y)e−γn
2
− e−γ [(n − 1)s2 + n(y − µ)2]]
∇2f (µ, γ) = [ ne−γ −n(µ − y)e−γ
−n(µ − y)e−γ e−γ
2
[(n − 1)2s2 + n(y − µ)2]] .
(see R demo)
Issues
▶ Convergence not guaranteed!
▶ Not a descent method: f (θk+1) < f (θk) not guaranteed.
▶ May converge to a local minimum (unavoidable; all methods
suffer from this problem).
▶ Unstable if ∇2f (θ) close to singular.
Positive definite approximations
▶ We can enforce a descent algorithm by replacing ∇2f (θk) with
a positive-definite matrix Ak:
∆θk = −A
−1
k ∇f (θk).
Reminder: A is positive-definite if for any column vector
yp×1 ≠ 0 we have y
T
Ay > 0.
▶ Then a sufficiently small movement in the direction of ∆θk
guarantees a decrease in f (θ):
f (θk + α∆θk) − f (θk) = ∇f (θk)T (α∆θk) + o(α)
= −α∇f (θk)TA−1k ∇f (θk)ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑ ÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
yTA−1
k
y>0
+o(α),
which is negative as α → 0. This requires to identify some
appropriate α.
Step-halving
A simple way of obtaining the constant α is to known as
step-halving. At iteration n:
1. Let ∆θk ← −A
−1
k ∇f (θk)
2. α ← 1
3. θ
∗
← θk + α∆θk
4. If f (θ∗) ≥ f (θk) make α ← α2 and go back to step 3.
5. take θk+1 = θ
∗
.
Scoring: