Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT3017/7017 - Big Data Statistics - Assessment 5 Page 1 of 3
Assessment 5
Due by Thursday 3 November 2022 09:00
[Total Marks: 28 (STAT7017) / 24 (STAT3017)]
Question 1 [6 marks]
Suppose we have p-dimensional independent samples 1, . . . ,n ∼ Np(0,Σ) where
Σ = Ip + ∆ with ∆ = diag(
m︷ ︸︸ ︷
ν, . . . , ν, 0, . . . , 0)
and ν ∈ ℝ. Let n be the sample covariance of 1, . . . ,n and let yn := p/n be the (finite
horizon) dimension-to-size ratio.
(a)[2] Suppose ν = 0. Explain how you would expect the eigenvalues λ1 ≥ λ2 ≥ · · · ≥
λp ≥ 0 of the sample covariance n to be behaved when p and n become large?
Between what values [a, b] do you expect the eigenvalues to be distributed between?
How will the eigenvalue λ1 behave with respect to b?
(b)[2] When ν ̸= 0, do the values of ν and m affect the behaviour discussed in (a)? Are
there any critical transition points for the variable ν under which the behaviour of
the largest eigenvalues of n change?
(c)[2] Through the use of simulations and appropriate figures, illustrate the various situations
that you have discussed in (a) and (b). Do this in the case p = 100 and n = 300.
Question 2 [18 marks]
Consider two p-dimensional populations with covariance matrices Σ1 and Σ2 and suppose
we had p-dimensional random samples 1, . . . ,m+1 ∼ Np(0,Σ1) and p-dimensional
random samples 1, . . . , n+1 ∼ N(0,Σ2). We stack these random samples to obtain the
data matrices X and Z and sample covariance matrices
1 :=
1
m
XXT , 2 :=
1
n
ZZT , := −12 1.
(a)[2] Assume n,m, p → ∞ such that yp := p/n → y ∈ (0, 1) and cp := p/m → c > 0.
For Σ1 = Σ2 = Ip, y = 1/2, and c = 1/4, what is the upper bound b of the limiting
spectral distribution of ?
(b)[2] Suppose that Σ1 = Σ2+∆ where ∆ = diag(
n1︷ ︸︸ ︷
a1, . . . , a1, 0, . . . , 0) and a1 > 0, i.e., Σ2
is perturbed by a rank n1 diagonal matrix ∆. What is the critical value κ for which
a1 > κ creates “outlier” sample eigenvalues? [1 mark]. Suppose that a1 = κ+ 1,
c = 2/3 and y = 1/3, what value do you expect these outlier eigenvalues cluster
around? [1 mark].
(c)[2] What would you expect to happen if a1 was only slightly larger than 1 and less than
κ?
Dale Roberts - Australian National University
Last updated: October 22, 2022
STAT3017/7017 - Big Data Statistics - Assessment 5 Page 2 of 3
(d)[2] Perform a simulation experiment to illustrate the phenomena in (b) in the case
Σ2 = Ip. That is, sample data and plot a histogram of eigenvalues of , compare it
to the theoretical density expected if a1 = 0, and plot the location where you expect
outlier eigenvalues to cluster around. Take n = 400, yn = 1/4, and cn = 1/8.
(e)[2] The paper [A] (see also [B]) suggests that the largest eigenvalue λ1 of , scaled as
(λ1 − b)/sp where b is from question (a) and sp := ( 1m(
√
m +
√
p)( 1√
m
+ 1√
p
))1/3,
behaves like a Tracy-Widom distribution of order 1. Show this using a simulation
in the case n = 400, yn = 1/4 and cn = 1/8. Plot the histogram and compare it
against the theoretical limiting density.
(f)[2] Proposition 4.1 of [B] suggests that when a1 > κ, outlier eigenvalues λ of that
are scaled like
√
p
(
λ− a1(a1 − 1 + c)
a1 − 1− a1y
)
are distributed like N(0, σ21) with
σ21 =
2a21(cy − c − y)(a1 − 1)2(−1 + 2a1 + c + a21(y − 1))
(1 + a1(y − 1))4
+ (v4 − 3)a
2
1(c + y)(−1 + 2a1 + c + a21(y − 1))2
(1 + a1(y − 1))4 ,
where v4 is the fourth moment of the entries of the data matrices. Check if this
proposition is correct using a simulation.
(g)[2] The paper [B] proposes a test for the equality between two high-dimensional covari-
ance matrices. They consider two p-dimensional samples (as above) from populations
with covariances Σ1 and Σ2, and are interesting in testing whether the difference
between Σ1 and Σ2 is a finite rank covariance matrix. That is, the following testing
problem:
H0 : Σ1 = Σ2 vs. H1 : Σ1 = Σ2 + ∆
where rank(∆) = M and M a finite integer. Their proposed test is to reject H0
if λ1 > qαsp + b where qα is the upper quantile at level α of the Tracy-Widom
distribution of order 1.
Under the assumption that a1 > κ, as per question (b), calculate the empirical power
of this test.
(h)[2] Compare the empirical power calculated in the previous question against the theoret-
ical power
Power = 1−Φ
(√
p
σ1
spqα +
√
p
σ1
(
b − a1(a1 − 1 + c)
a1 − 1− a1y
))
where Φ is the standard normal cumulative distribution function and the other terms
are defined as per the previous questions.
Dale Roberts - Australian National University
Last updated: October 22, 2022
STAT3017/7017 - Big Data Statistics - Assessment 5 Page 3 of 3
(i)[2] Consider the signal detection problem where we are trying to determine the number
of signals in observations of the form
xi = Usi + εi , i = 1, . . . , m, (SD)
where the xi ’s are p-dimensional observations, si is a k × 1 low dimensional signal
(k ≪ p) with covariance Ik , U is a p × k mixing matrix, and (εi) is an i.i.d. noise
with covariance matrix Σ2. None of the quantities on the right hand side of (SD)
are observed. In [B], they propose to estimate the number of signals k by
kˆ := max{i : λi ≥ β + log(p/p2/3)},
where (λi) are the eigenvalues of . Reproduce Table 1 in [B] for the Gaussian case
for values p = 25, 75, 125, 175, 225, 275.
Question 3 [4 marks]
(STAT7017 students only) Suppose we have a p-dimensional independent sample
1, . . . ,n with sample covariance matrix n. We denote the sample correlation matrix by
R̂n := [diag(n)]−1/2n[diag(n)]−1/2,
and the (population) correlation matrix by R. Then, for example, the one-sample testing
problem is concerned with testing H0 : R = R∗ versus Ha : R ̸= R∗ where R∗ is a specific
correlation matrix (e.g., R∗ = Ip the identity matrix). The asymptotic distribution of the
test statistic
T := (n − 1){log(|R∗|/|R̂n|)− p + tr(R−1∗ R̂n)}
often plays a role in accepting or rejecting H0. Before 1969, it was (incorrectly) thought
that when p fixed and n →∞ the test statistic T was asymptotically distributed as χ2ρ
with ρ = 1
2
p(p − 1). Fix p = 2 so that we can write R and R̂n in terms of the correlation
coefficient ρ and sample correlation coefficient ρˆ as
R =
(
1 ρ
ρ 1
)
, R̂n =
(
1 ρˆ
ρˆ 1
)
, −1 ≤ ρ, ρˆ ≤ 1.
(a)[4] Using the delta method, show that the correct limiting distribution for T is (1+ρ2)χ21
when p = 2 and n →∞.
References
[A] Han, Pan, Zhang (2016). The Tracy-Widom law for the largest eigenvalue of F-type matrices. Annals of
Statistics, Vol. 44.
[B] Wang, Yao (2017). Extreme eigenvalues of large-dimensional spiked Fisher matrices with application.
Annals of Statistics, Vol 45, No. 1.
Note: I have placed these references in the ‘Readings’ folder on Wattle.