Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT 2132 take-home final exam
This exam is to be completed on your own. Since no one is watching you to make sure you
follow that direction, I am trusting you to abide by the honor code and not collaborate with your
classmates. I will be available via email and will hold my regularly scheduled office hours on
Monday and Tuesday next week.
Please be concise with your answers, and do not turn in any R code (I will not read it if you
do). All numerical answers should be rounded to three decimal places. The exam contains three
questions, and all are related to the experiment described on the next page. Good luck!!
1
Anthracyclines are a class of important chemotherapy drugs used to treat many different
types of cancers. However, anthracyclines have several non-trivial side effects, one of which is
anthracycline-induced cardiotoxicity (ACT). Some patients with ACT do not manifest symptoms,
while ACT in others can lead to congestive heart failure (CHF). The likelihood of a patient develop-
ing ACT-induced CHF is dependent on the dose of anthracycline administered during chemother-
apy, and is as low as 5% for small doses and as high as 48% for larger doses (Swain et al., 2003).
Assume your collaborator treated cardiomyocytes (heart cells) derived from m = 46 unrelated
individuals with five dosages of an anthracycline to study how the expression of various genes
in the heart changes in response to anthracycline. While the full data originally included 230
samples (five samples for each of the 46 individuals), 13 were removed after an initial QC step.
The RData file “Genes.RData”, which you can load into R using the command load(file =
"Genes.RData"), contains two objects:
• Data: A 217 × 2 data frame.
– cell line: The cell line from which the sample was derived. Each cell line comes
from a unique individual.
– conc: The concentration (in micromoles) of anthracycline administered to the car-
diomyocyte sample. A concentration of 0 indicates no anthracycline was administered.
• Y: A 12, 317 × 217 matrix containing the log-transformed expression of p = 12, 317 genes
from the n = 217 samples. The columns of Y align with the rows of Data.
Throughout this problem, let Y ∈ Rp×n be the expression data matrix, and let g = 1, . . . , p index
gene and i = 1, . . . , n index sample. For each sample, define r(i) ∈ {1, . . . ,m = 46} to be the
individual used to generate sample i. The gth row, ith column and (g, i)th element of Y will be
denoted as Yg∗ ∈ Rn, Y∗i ∈ Rp and Ygi ∈ R, respectively.
1) Consider the model for a single gene g. Assume individuals are randomly drawn from a pop-
ulation and treat dose as a factor variable with five ordered levels, where level 1 corresponds
to a dose of 0 and level 5 corresponds to a dose of 5 micromoles. Suppose
Ygi =
5∑
d=1
1
{
sample i was given dose d
}
βgd + δgr(i) + egi
= xTi βg + δgr(i) + egi, i = 1, . . . , n (1)
where βg =
(
βg1 · · · βg5
)T ∈ R5, δg1, . . . , δgm i.i.d∼ N (0, τ2g), eg1, . . . , egn i.i.d∼ N (0, σ2g) and egi is
independent of δg j for all i = 1, . . . , n and j = 1, . . . ,m. Note that βg, τ2g and σ
2
g all depend
on the gene g, but that the covariates xi are shared across genes.
(a) What is the interpretation of βgd, τ2g and σ
2
g? Is βgd identifiable in (1)?
(b) Express the intraclass correlation for each gene g in terms of the parameters in Model
(1).
2
(c) For parts (c)-(f), use only the individuals with 5 dosage measurements (there should
be 37 such individuals). Let β¯g = 15
∑5
d=1 βgd be the mean effect for gene g. What is
the generalised least squares (GLS) estimate for the mean-centered coefficients ∆g =
βg−15β¯g, ∆ˆg, in terms of the Ygi’s? Your solution SHOULD NOT depend on τ2g or σ2g.
(d) Express Var
(
∆ˆg
)
in terms of the parameters in Model (1), and show that it does not
depend on τ2g.
(e) Derive an analytic, unbiased estimator for σ2g, and prove it is unbiased. Use this to
derive an expression for V̂ar
(
∆ˆg
)
, an estimate for Var
(
∆ˆg
)
. (Hint: consider each
individual’s expression Ygr∗ = (Ygi){i:r(i)=r∗} ∈ R5 for each r∗ ∈ {1, . . . ,m}, and derive
the model for Q5Ygr∗ , where Q5 ∈ R5×5 is the orthogonal projection matrix for 15’s
orthogonal compliment.)
(f) Using only the 37 individuals with 5 dosage measurements and your solutions to (c)
and (e), compute ∆ˆg and V̂ar
(
∆ˆg
)
for g = 1, . . . , p in R. Use these to derive a P-value
Pg for the null hypothesis H0,g : βg1 = · · · = βgd for each gene g = 1, . . . , p. Using
Bonferroni to control the FWER at a level α = 0.05, how many of H0,g do you reject?
What does this tell you about the relationship between of anthracycline dosage and
gene expression?
(g) Additional analyses reveal that there is substantial correlation between the p genes. If
the P-values Pg are super-uniform when H0,g is true (i.e. P(Pg ≤ u) ≤ u under H0,g),
will the Bonferroni procedure control the FWER at the desired level α? If yes, prove
it. If not, give a counter-example.
2) Use all n = 217 observations to answer all remaining questions. Your collaborator believes
that one of the mechanisms through which anthracycline induces ACT is by inhibiting apop-
tosis (i.e. cell death). You therefore decide to investigate the expression of the gene JADE1
(g = 3884), which is known to promote apoptosis in healthy patients.
(a) Using whatever method you deem most appropriate, estimate τ2g, σ
2
g and the intraclass
correlation for g = 3884.
(b) Consider the null hypothesis H0 : βg1 = · · · = βg5, and take the alternative hypothesis
to be the negation of H0. Use a likelihood ratio test to test H0 for g = 3884. What do
you conclude?
(c) An important assumption you made in the previous part was that Yg∗ for g = 3884
was normally distributed. Does that assumption seem reasonable? Use a QQ-plot to
justify your answer. (Hint: make sure that the QQ-plot is not corrupted by the possible
correlation in the data. Try multiplying the residuals with an appropriate matrix to
mitigate possible correlation.)
(d) Using whatever method you deem most appropriate, estimate and find a 95% confi-
dence interval for βg5−βg1 for g = 3884. What can you conclude about the relationship
between anthracycline dosage and JADE1 expression? Does this support your collab-
orator’s hypothesis that anthracycline induces ACT by inhibiting apoptosis?
3
(e) Now suppose you want to determine simultaneous 95% confidence intervals for all
points in
S =
cTβg : 5∑
d=1
ci = c
T15 = 0
for g = 3884.
(i) Suppose Yg∗ ∼ N
(
Xβg,Σg
)
, where Σg is a known positive definite matrix. Show
that there exists a matrixA such thatAYg∗ ∼ N
(
(AX)βg, In
)
.
(ii) By plugging in the estimate for Σg you determined in part (a) for the true Σg, use
part (i) and Scheffe´’s procedure to obtain simultaneous 95% confidence intervals
for all points in S.
(iii) Part (ii) ignores the uncertainty in the estimate for Σg when computing the simul-
taneous confidence intervals. Would you expect a method that properly accounts
for the uncertainty in said estimate to return wider or narrower confidence intervals
than those computed in part (ii)? Explain.
(f) For interpretability purposes, your collaborator would prefer a model where E(Yg∗)
is linear in dose. Use a likelihood ratio test to test the null hypothesis that E(Ygi) =
µg + dose(i)γg for g = 3884 and i = 1, . . . , n, where dose(i) ∈ {0, 0.625, 1.25, 2.5, 5} is
the anthracycline dosage applied to sample i. What is your conclusion?
3) Assume Model (1) is correct. Model (1) implies Yg∗ can be written as
Yg∗ ∼ N
(
Xβg, τ
2
gB + σ
2
gIn
)
, g = 1, . . . , p.
(a) Derive an expression forB.
(b) Estimating τ2g and σ
2
g for each gene g = 1, . . . , p is very computationally expensive.
The rate limiting steps are computing the inverse and determinant of Var
(
Yg∗
)
∈ Rn×n,
both of which require O
(
n3
)
operations for unstructured matrices. This implies naive
algorithms to estimate τ21, σ
2
1, . . . , τ
2
p, σ
2
p, like ML and REML, will have O
(
pn3
)
run-
times. Here you will construct a more computationally efficient algorithm to estimate
τ2g, σ
2
g for all g = 1, . . . , p.
(i) Show that there exists a non-random matrixQ ∈ Rn×(n−5) with orthonormal columns
such that
Y˜g∗ = QTYg∗ ∼ N
(
0, τ2gB˜ + σ
2
gIn−5
)
, g = 1, . . . , p.
What is B˜ in terms ofQ andB?
(ii) Find a non-random unitary matrix U ∈ R(n−5)×(n−5) such that U T Y˜g∗ ∈ Rn−5 is nor-
mally distributed with independent entries for all g = 1, . . . , p. Derive expressions
for E
(
U T Y˜g∗
)
and Var
(
U T Y˜g∗
)
(the latter should be a diagonal matrix).
4
(iii) If p >> n, explain why estimating τ2g, σ
2
g for all g = 1, . . . , p using U
T Y˜g∗ is more
efficient than just using Yg∗ (or Y˜g∗). When answering this question, consider how
many operations it requires to invert and calculate the determinant of Var
(
U T Y˜g∗
)
.