MTHM506/COMM511: Statistical Data Modelling
Statistical Data Modelling
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MTHM506/COMM511: Statistical Data Modelling
Question Sheet 2
Marks achieved in this assignment will contribute towards 25% of the final module mark. You should attempt
all questions on this sheet. Note that the questions are organised in the order we covered the topics, and not
in order of difficulty. Therefore it is advised that you read through the questions first, and start working on
those that you feel more comfortable with.
Deadline: Noon (12pm), on 25th March 2022
You should submit one pdf via eBART containing your solutions - it should be written up using word
processing software (e.g. LaTeX, R Markdown, or Word). Solutions are expected to be concise, well
structured and well presented. Commented R code (e.g. ‘model <- glm(...)’) and the outcomes/plots
should form part of your solutions. Do not display too much raw R output (e.g. don’t display the full output
of ‘summary(model)’), but edit this down to the essentials. Ensure to include justification for each step of
your analyses, providing comments alongside your R code to explain what you are doing and add appropriate
titles and labelled axes to your plots.
You are expected to work independently - strict disciplinary action will be taken for any plagiarism. Late
submissions will also be penalised.
The data required for this assignment datasets_sheet2.RData can be downloaded from the ELE page and
loaded into R using the load() function.
Question 1
In this question we return to Question 1 from Question Sheet 1 where we fit the following non-linear Gaussian
model:
Yi ∼ N
(
θ1xi
θ2 + xi
, σ2
)
i = 1, 2, . . . , 100, Yi independent
on the dataframe data frame nlmodel which contains data on a response variable y and a single explanatory
variable x.
(a) [1 mark] Fit a Gaussian GAM with identity link using the function \gam() in R package mgcv. Use a
cubic spline basis and assume a basis dimension of q = 9.
(b) [5 marks] Determine whether the rank of 9 is enough by running the function gam.check() on the
fitted model and checking whether the effective degrees of freedom is close to the maximum possible
(q − 1). This function also produces residual plots so also comment on those (but note that it plots
deviance residuals and not standardised deviance residuals).
(c) [3 marks] Use the function predict() with se.fit=T to produce the fitted line along with 95%
confidence intervals. Given that the true relationship between x and the mean is the one given above,
i.e. (θ1x)/(θ2 + x), state what you may want to do with the model.
Question 2
In this question we return to Question 2 from Question Sheet 1 where we fit a series of models using the
number of quarterly aids cases in the UK, yi, from January 1983 to March 1994. The data are in dataframe
aids, where the variable cases is yi and date is time, symbolised here as xi.
1
(a) [7 marks] Fit a Poisson GAM (using a cubic spline) with a log link, where the response is the number of
cases and date is the predictor. Plot the counts against date and add the predicted line with associated
95% confidence intervals, and comment on the fit. Make sure to use an appropriate rank and perform
relevant model checking with respect to residual plots and the deviance.
(b) [8 marks] Suggest two alternative models that would improve the fit. Implement one of these, and
perform the same model checking as in part (a). Also produce a plot of the predicted line with 95% CIs.
Comment on any differences between the predicted smooth lines of the two models and the possible
reason behind this.
Question 3
The dataframe pupils which involves language scores in Dutch schools. This is an example of a two level
situation. Specifically, the data considers 131 schools (but only 1 class per school) for i = 1, 2, . . . , 2287
students in grades 7 and 8. The nesting therefore occurs within each school j = 1, 2, . . . , 131. Interest lies in
assessing the impact on language scores of pupil factors such as IQ (IQ) and pupil social status (ses). The
response variable is denoted as test. The categorical variable (factor) Class refers to the class that each
pupil belongs to (so Class = 1, 2, . . . , 131).
(a) [2 marks] First fit a (Gaussian) linear model using glm() with test as the response and IQ, ses and
factor Class as the covariates. Comment on the significance of the two continuous variables and perform
a likelihood ratio test to test on the overall significance of the factor Class.
(b) (i) [2 marks] State two reasons why one might want to treat the class effects as random.
(ii) [2 marks] Write down the mathematical formulation of a Normal random effect model (IQ and
ses as fixed effects and Class as a random effect).
(iii) [3 marks] Fit this model and comment on the significance of the fixed effects based on t-tests.
State any assumptions you are making.
(iv) [5 marks] What is the estimate of the “within-class” variance and the “between-class” variance.
What is the estimate of marginal variance of the response, it is different to the (marginal) variance
from the model in (a) and if so, why?
(v) [3 marks] Test whether the variance of the random effects is zero (i.e. the significance of the
random effects) using a likelihood ratio test.
(vi) [6 marks] Comment on the validity of the Likelihood Ratio Tests in mixed effects models, suggest
an alternative way of implementing these tests, and use it to compare with results in (b)-(v).
(vii) [3 marks] Plot a density estimate of the predicted random effects and superimpose their theoretical
Normal distribution using the estimate of their variance. Use functions qqnorm() and qqline()
to produce a QQ plot of the random effects and comment on the validity of a Gaussian model for
the random effects.
(viii) [3 marks] Note that the functions fitted() and resid() in lme4, will produce the fitted values yˆ
and raw residuals y − yˆ. Use these functions, in conjunction with the two functions in (vii) to
produce a QQ plot of the residuals and a residuals vs fitted values plot. Comment on the model
assumptions using the two plots.
(c) One of the student-level covariates is IQ which may affect the test results per student. However, there
may be class level (latent) variables, such as teacher competence, which may have an effect on how
IQ relates to the test result in each class. Such a scenario may be accommodated by considering the
parameter of IQ to be random rather than it being fixed (and constant across classes).
(i) [4 marks] Extend the model in (b) to make the parameter of IQ vary with Class. Compare
(qualitatively) the overall effect of IQ on test between this model and the model in (b).
(ii) [3 marks] Test for the significance of the random slope for IQ using a likelihood ratio test.