Advanced Topics in Statistics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MTHM017 Advanced Topics in Statistics
Assignment
Please make sure that the submitted work is your own. This is NOT a group assignment,
therefore approaches and solutions shouldn’t be discussed with other students. Plagiarism
and collusion with other students are examples of academic misconduct and will be reported.
More information on academic honesty can be found here.
The assignment has two main parts. Part A involves (i) fitting a Poisson regression model to assess the
effect of using different priors, and (ii) fitting a fixed effects and a random effects model in order to compare
posterior distributions under different models. Part B involves using different methods for classification of
data into two groups.
A. Bayesian Inference [66 marks]
1. The first question of part A involves fitting a Poisson regression model using the Scotland dataset, which
contains the observed and expected counts of lip cancer cases for the local authority administrative areas in
Scotland between 1975 and 1986.
a. [3 marks] Calculate the Standard Mortality Ratios (SMR=observed/expected) for each administrative
area and plot the distribution of the SMRs. Then plot a map of the SMRs by administrative area. To
map the SMRs you may use the ScotlandMap function readily available on the ELE page.
The code below uses random numbers to demonstrate how the ScotlandMap function works. Note
that in order for this to run you need to add the shapefiles for Scotland (.shx, .shp, .prj, .dbf) to your
working directory. The ScotlandMap.R file and the shapefiles are available on the ELE page.
library(tidyverse)
library(RColorBrewer)
library(sf)
library(rgdal)
source("ScotlandMap.R") # need to read in the ScotlandMap function
testdat <- runif(56) # generate random numbers to use as observations
ScotlandMap(testdat,figtitle="Scotland random numbers")
We are interested in estimating the relative risk (RR) for each area. The relative risk is the parameter
that quantifies whether an area i has a higher (RRi > 1) or lower (RRi < 1) occurrence of cases than
what we would expect from the reference rates. In order to estimate the relative risk we are going to fit
a Poisson model of the following form:
Obsi ∼ Pois(µi)
log(µi) = log(Expi) + β0 + θi
RRi = exp(β0)exp(θi)
Where θi ∼ Normal(0, σ2). Here, the Exp(ected) numbers are an ‘offset’, which are treated as fixed
data (i.e. no coefficient is estimated for this term).
b. [3 marks] Describe the role (interpretation) of β0 and the set of θ’s in this model and how they contribute
to the estimation of RR. Focus on the meaning of these parameters in the estimation of the relative risk.
1
c. [14 marks] Code up this Poisson(-Lognormal) model in JAGS to analyse the Scotland data. For the
parameter β0 use a Normal prior with mean 0 and variance 1000. And for the precision parameter
τ(= 1/σ2) of θi use a prior τ ∼ Gamma(1, 0.05) (which is parametrised so as to have a mean of 20).
Initialise 2 chains and run the model with these two chains. You will have to decide on the appropriate
values of n.iter and burnin. Produce trace plots for the chains and summaries of all the model
parameters. Investigate whether the chains for all the parameters have converged.
d. [4 marks] Extract the posterior means for the RR and map them. Compare the posterior mean relative
risk to the SMR for each area. Note that we can think about the SMR as the maximum likelihood
estimate of the relative risk.
e. [7 marks] Now calculate the posterior probabilities that the relative risk in each area exceeds 1. Extract
these probabilities and map them. For which areas are you at least 95% certain that there is an excess
risk of lip cancer (i.e. relative risk > 1)?
f. [6 marks] Repeat the analysis with different priors for β0 and τ . The exact choice is yours, but explain
why you have chosen them and what they mean. Map the two sets of RRs and explain any differences
you see in the map and the summaries of the posteriors for the parameters of the model.
2. In this question we will analyse the effect different models have on the posterior distribution. We will use
the surgical.csv file, which contains data on 12 hospitals, where the column n gives the total number of
heart surgery operations carried out on children under 1 year old by each centre between April 1991 and
March 1995, and the column r gives the number of these operations where the patient died within 30 days of
the operation.
A binomial model seems reasonable for these data, so we will focus on comparing different models for the
hospital specific mortality rates.
We will first fit a logistic random effects model of the following form:
ri ∼ Binom(ni, θi),
logit(θi) ∼ N(µ, σ2).
This model assumes that failure rates (θi) across hospitals are similar in some way. This similarity is
represented by the assumption that the mortality rates for the hospitals are drawn from the same (parent)
distribution.
a. [9 marks] Code this model in JAGS to analyse the surgical data. Use vague priors for the parameters
of the shared normal distribution. In particular you can use the following model definition:
jags.mod <- function(){
for(i in 1:12){
r[i] ~ dbin(theta[i],n[i])
logit(theta[i]) <- logit.theta[i]
logit.theta[i] ~ dnorm(mu, tau)
}
mu ~ dnorm(0,1.0E-3)
tau <- 1/sigmaˆ2
sigma ~ dunif(0,100)
}
Run the model for 10,000 iterations, with 2 chains, discarding the first 5,000 as ‘burn-in’. Produce
traceplots for the chains and summaries for the fitted parameters. Comment on whether the chains for
all the parameters have converged.
b. [8 marks] We want to know whether the assumptions of the above random effects model are reasonable,
or whether there is any evidence that the mortality rates for one or more of the hospitals are not drawn
from the same parent distribution. We will assess this by comparing the observed and predicted number
of deaths in each hospital.