Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT3006/STAT7305 Assignment
Weighting: 30%
Instructions
• The assignment consists of five (5) problems, each worth 30 marks, and with each
mark having equal weight.
• The mathematical elements of the assignment can be completed by hand, in LaTeX (prefer-
ably), or in Word (or other typesetting software). The mathematical derivations and ma-
nipulations should be accompanied by clear explanations in English regarding necessary
information required to interpret the mathematical exposition.
• Mathematical problems should be answered with reference to results presented in the Lecture
Slides (refer, page numbers), if required. If a mathematical result is used that is not presented
in the Lecture Notes, then its common name (e.g., “Bayes’ Theorem”, “Intermediate Value
Theorem”, “Boole’s Inequality”, etc.) should be cited, or else a reference to a text containing
the result should be provided (preferably a textbook and inadvisably websites such as stac
kexchange.com).
• Please ensure that you complete this assessment using the notation and language
provided in the Lecture Slides. The mathematical concepts covered in this course
may be presented differently in various sources across the literature and the internet, using
different notations. To maintain consistency and academic integrity, and to simplify the
grading process, please stick closely to the notations and conventions introduced in the
course whenever possible.
• All submission files should be labeled with your name and student number and
compiled to a single .PDF or .DOCX document, submitted to the appropriate
what you
submit should be your own work. Even where working from sources, you should endeavor
to write in your own words. You should use consistent notation throughout your assignment
and define whatever is required.
Problem 1 [30 Marks]
Let f : R→ R be a convex function and let θ = (ϑ1, . . . , ϑq) ∈ T ⊂ Rq and π = (ϖ1, . . . , ϖq) ∈ Rq≥0,
where
∑q
j=1ϖj = 1. The discrete Jensen’s inequality states that
f (⟨π, θ⟩) ≤
q∑
j=1
ϖjf (ϑj) .
(a) Prove the discrete Jensen’s inequality.
[5 Marks]
Let g : T→ R be defined by g (θ) = f (⟨α, θ⟩), where α = (a1, . . . , aq) ∈ Rq.
(b) Show that g has the majorizer mg : T× T→ R, where
mg (θ, τ) =
q∑
j=1
1
q
f (qaj (ϑj − tj) + ⟨α, τ⟩) ,
for each θ and τ = (t1, . . . , tq) ∈ T.
[10 Marks]
Let XN = (Xn)n∈N be an independent and identically distributed (IID) sequence of random vari-
ables on (Ω,F,P), where Xn = (Wn, Yn) ∈W×Y = X, for each n ∈ N, whereW ⊂ Rd and Y = R.
Suppose that each Xn is identically distributed to X = (W,Y ) with pushforward measure space
(X,B (X) ,PX).
Consider the loss function:
(X, θ) 7→ ℓ (X, θ) = |Y − α− ⟨W,β⟩|2 ,
where θ = (α, β) ∈ T ⊂ R1+d, with associated sample risks:
rn (θ;Xn) =
1
n
n∑
i=1
ℓ (Xi, θ) =
1
n
n∑
i=1
|Yi − α− ⟨Wi, β⟩|2 .
2
(c) Using the majorizer from Part (b), propose an MM algorithm for computing
θˆn = argmin
θ∈T
rn (θ;Xn) ,
that does not require the matrix inversion operation. You may assume that T is
sufficiently large such that the resulting MM algorithm sequence θN is contained in
O ⊂ T, where O is an open set.
[10 Marks]
Let
r (θ) = EX [ℓ (X, θ)] = EX
[|Y − α− ⟨W,β⟩|2] ,
denote the expected risk, and write its minimizer as
θ0 = argmin
θ∈T
r (θ) .
(d) Discuss the assumptions required to ensure that
θˆn
a.s.−→
n→∞
θ0.
[5 Marks]
Problem 2 [30 Marks]
Let T ⊂ Rq be a convex set. A function f : T→ R is said to be strongly convex with modulus
µ > 0 if for each θ, τ ∈ T and λ ∈ [0, 1]:
f (λθ + (1− λ) τ) ≤ λf (θ) + (1− λ) f (τ)− µλ (1− λ) ∥θ − τ∥22 .
(a) Show that if f is strongly convex on T, then f is also strictly convex on T, and that f
is strongly convex on T with modulus µ > 0 if and only if
θ 7→ f (θ)− µ ∥θ∥22
is convex on T.
[5 Marks]
Let r : T → R be a continuous objective function with continuous majorizer m : T × T → R,
where T is compact. Suppose further that θN =
(
θ(s)
)
s∈N is an MM sequence, initialized at some
3
θ(0) ∈ T, that satisfies the SUMMA condition:
m
(
θ, θ(s)
)−m (θ(s+1), θ(s)) ≥ m (θ, θ(s+1))− r (θ) , for each s ∈ N.
(b) Show that the objective sequence corresponding to θN:
(
r
(
θ(s)
))
s∈N, converges to a
the global minimal value:
lim
s→∞
r
(
θ(s)
)
= min
θ∈T
r (θ) .
[10 Marks]
Along with compactness, make the additional assumption that T is convex and that m (·, τ) : T→
R is uniformly strongly convex in the sense that m (·, τ) is strongly convex with modulus µ > 0,
for every τ ∈ T.
(c) Further show that θN converges to a minimum of r. That is, there exists a
θ∗ ∈ argmin
θ∈T
r (θ) ,
such that
lim
s→∞
θ(s) = θ∗.
[5 Marks]
Let T ⊂ O, let φ : O → R be a convex and differentiable function, and define the Bregman
divergence corresponding to φ :
(θ, τ) 7→ Bφ (θ∥τ) = φ (θ)− φ (τ)− ⟨∇φ (τ) , θ − τ⟩ .
(d) Supposing that r : T→ R is convex and differentiable, show that
(θ, τ) 7→ mφ (θ, τ) = r (θ) + Bφ (θ∥τ)
is a majorizer of r, and that mφ and r together satisfy the SUMMA condition.
[10 Marks]
Problem 3 [30 Marks]
Let P and Q be probability measures on the measurable space (Ω,F), where P and Q are both
absolutely continuous with respect to some dominating measure M. In particular, we suppose that
P and Q have Radon–Nikodym derivatives p and q, with respect to M, respectively.
4
We define the total variation distance between the measures P and Q as
TV (P,Q) = sup
A∈F
|P (A)−Q (A)| .
(a) Show that
TV (P,Q) =
1
2
∫
Ω
|p (ω)− q (ω)|M (dω) = 1−
∫
min {p (ω) , q (ω)}M (dω) .
[10 Marks]
An alternative distance between measures is the Hellinger distance, given by
H (P,Q) =
{∫
Ω
∣∣∣√p (ω)−√q (ω)∣∣∣2M (dω)}1/2 .
(b) Show that
1
2
{H (P,Q)}2 ≤ TV (P,Q) ≤ H (P,Q)
√
1−
{
H (P,Q)
2
}2
.
Hint—Use the fact that min (x, y) = x/2 + y/2− |x− y| /2, for each x, y ∈ R, and the
Integral form of the Cauchy–Schwarz inequality, for f, g ∈ L2 (Ω,F,M):∣∣∣∣∫
Ω
fgdM
∣∣∣∣2 ≤ ∫
Ω
f 2dM
∫
Ω
g2dM.
[10 Marks]
Let (fn)n∈N be a sequence of (F→ B (R))-measurable functions on (Ω,F,M), and suppose that fn
converges in to a (F→ B (R))-measurable function f , almost everywhere, and suppose that there
exists a dominating (F→ B (R))-measurable function ∆, such that for almost every ω ∈ Ω and
every n ∈ N, |f (ω)|+ |fn (ω)| ≤ ∆(ω), where
∫
Ω
∆dM <∞.
(c) Using the Dominated Convergence Theorem, show that∫
Ω
|fn − f | dM −→
n→∞
0.
[5 Marks]
Let (Pn)n∈N be a sequence of measures on (Ω,F) with sequence of Radon–Nikodym derivatives
(pn)n∈N, with respect to the dominating measure M. Further, let P be another measure on (Ω,F),
with Radon–Nikodym derivative p.
5
Scheffé’s Theorem states that if limn→∞ pn (ω) = p (ω), for almost every ω ∈ Ω, with respect
to the measure M, then (Pn)n∈N converges to P in total variation, in the sense that
TV (Pn,P) −→
n→∞
0.
(d) Using the conclusion of Part (c), prove Scheffé’s Theorem.
[5 Marks]
Problem 4 [30 Marks]
Let C ⊂ Rq be a closed and compact set. We define the Euclidean projection of a point θ ∈ Rq
onto the set C by
projC (θ) = argmin
τ∈C
∥θ − τ∥22 .
Recall that the closed Euclidean ball of radius ρ > 0, centered at υ ∈ Rq, can be defined written
as:
B¯ρ (υ) = {τ ∈ Rq : ∥υ − τ∥2 ≤ ρ} .
(a) Show that
projB¯ρ(υ) (θ) =
υ +
ρ
∥θ−υ∥2 (θ − υ) if θ /∈ B¯ρ (υ) ,
θ if θ ∈ B¯ρ (υ) .
[5 Marks]
Consider instead the closed half-space:
H (α, b) = {τ ∈ Rq : ⟨α, θ⟩ ≤ b} ,
where α ∈ Rq\ {0} and b ∈ R.
(b) Show that
projH(α,b) (θ) =
υ −
⟨α,θ⟩−b
∥α∥22
α if θ /∈ H (α, b) ,
θ if θ ∈ H (α, b) .
[10 Marks]
Let r : T→ R be a differentiable function, where T ⊂ Rq. Assume that r is L-smooth, in the sense
that
∥∇r (θ)−∇r (τ)∥2 ≤ L ∥θ − τ∥2
for each θ, τ ∈ T. Suppose that C ⊂ T is a closed and convex set.
6
We say that θN =
(
θ(s)
)
s∈N is a projected gradient descent sequence (intialized at θ
(0) ∈ C)
if, for each s ∈ N,
θ(s) = projC
(
θ(s−1) − 1
L
∇r (θ(s−1))) .
(c) Show that the projected gradient descent sequence is actually an MM sequence in the
sense that there exists a majorizer m : T× T→ R of r, such that, for each s ∈ N:
θ(s) = argmin
θ∈C
m
(
θ, θ(s−1)
)
.
[10 Marks]
(d) Discuss whether you can conclude that the sequence θN converges to the set of minima
of r:
argmin
θ∈C
r (θ) .
If the available information is not sufficient to establish convergence to the minima of
r, suggest some additional assumptions that will permit the conclusion of convergence.
Furthermore, suggest assumptions that would be required to quantify the rate at which
the sequence of values (
r
(
θ(s)
))
s∈N
converges to
min
θ∈C
r (θ) .
[5 Marks]
Problem 5 [30 Marks]
For each δ > 0, define the error function:
Eδ (u) =
δ2 if |u| > δ,u2 if |u| ≤ δ.
(a) Show that mδ : R× R→ R is a majorizer for Eδ, where
mδ (u, v) =
δ2 + (u− v)
2 if |v| > δ,
u2 if |v| ≤ δ.
[5 Marks]
7
Let XN = (Xn)n∈N be an IID sequence of random variables on (Ω,F,P), where Xn = (Wn, Yn) ∈
W × Y = X, for each n ∈ N, where W ⊂ Rd and Y = R. Suppose that each Xn is identically
distributed to X = (W,Y ) with pushforward measure space (X,B (X) ,PX) and define the loss
function, based on Eδ:
(X, θ) 7→ ℓδ (X, θ) = Eδ (Y − α− ⟨W,β⟩) ,
where θ = (α, β) ∈ T ⊂ R1+d, with associated sample risks:
rn (θ;Xn) =
1
n
n∑
i=1
ℓ (Xi, θ) =
1
n
n∑
i=1
Eδ (Y − α− ⟨Wi, β⟩) .
(b) Using the result from Part (a), derive a majorizer mn : T× T→ R for rn (·;Xn).
[10 Marks]
(c) Following from Part (b), produce an MM algorithm:
A
(
θ(s−1)
)
= argmin
θ∈T
mn
(
θ, θ(s−1)
)
,
for the problem of finding the minima of rn (·;Xn):
argmin
θ∈T
rn (θ;Xn) .
You may assume that T is sufficiently large such that A
(
θ(s−1)
) ∈ O, for every s ∈ N,
where O ⊂ T is an open set.
[10 Marks]
(d) Define the MM algorithm sequence θN =
(
θ(s)
)
s∈N (initialized at some θ
(0) ∈ T) by
θ(s) = A
(
θ(s−1)
)
,
for each s ∈ N. Discuss whether there is sufficient information to conclude that θN
converges to the minima of rn (·;Xn) or the set of critical points
KG (rn (·;Xn) ,T) .
Within the framework of this course, are there additional assumptions that we can
make to guarantee the convergence of the MM algorithm sequence to either the set of
minima or the set of critical points of rn (·;Xn)? If not, what assumptions are violated
that prevent us from making such conclusions.