Advanced Topics in Statistics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MTHM017 Advanced Topics in Statistics
Assignment
The assignment has three main parts. Part A involves fitting an auto-regressive process to time series data
model using the BUGS language and assessing the effect of using different model structures on the estimation
of missing data. Part B involves using different methods for classification of data into two groups. Part C
involves producing a narrated power point presentation based on question 3 of part B.
Part A gives 50% of your final marks, Part B gives 30% of your final marks and Part C gives 20% of your
final marks. [Assignment: 160 marks in total]
A. Bayesian Inference [80 marks]
The dataset contains measurements of particulate matter (PM10) air pollution in London (measured at the
Bexley and Hounslow sites) for 2000 to 2004. The data can be found in London_Pollution.csv.
1. [4 marks] Summarise the two sets of data and calculate the number of missing data points for each
monitoring location, by year. Comment on whether the patterns of missingness have changed over time.
2. [3 marks] Plot the PM10 measurements against time for the two sites, highlighting (showing clearly)
the periods of missing data.
3. [5 marks] The locations in Eastings and Northings of the two locations are Bexley: (551862, 176380);
and Hounslow: (521070, 178480). Plot these two monitor locations on a map of London and comment
on any difference you found in the summaries of the data in the context of the geographical location of
the monitoring sites. The necessary shapefiles are on the ELE page of the course.
Considering the Bexley data, there is missing data. We are going to fit a model that allows us to estimate these
missing data by treating them as model parameters that will be estimated (and we find posterior distributions
for them). As we have time series data, we are going to use the fact that day-to-day measurements will be
correlated, i.e. today’s measurement will correlate with yesterday’s.
A random walk process of order 1, RW(1), is defined at time t as
Yt − Yt−1 = wt
Yt = Yt−1 + wt
Where wt are a set of realisations of random (or white) noise, e.g. wt ∼ N(0, σ2w). Note the first line refers to
the differences in the values at consecutive time points being white noise.
We are interested in fitting a random walk model to the Bexley data. The model will be of the following form:
Bexleyt ∼ N(Yt, σ2v)
Yt ∼ N(Yt−1, σ2w)
Where σ2w is the variance of the white noise process associated to the random walk. We then make noisy
measurements of this random walk process, thus Bexleyt, the measurement we have at time t, equals to
the true value of the underlying process Yt plus some measurement error. In the formula above, σ2v is the
variance of this measurement error.
4. [16 marks] Code this model using the model definition below in JAGS to analyse the Bexley data from
1st January 2000 to 31st December 2003 (NOTE the end year). Due to the nature of the model you
1
will have to explicitly specify a value for Y1 in the model (i.e. for the first time point as Y0 doesn’t
exist). One suggestion might be Y1 ∼ dnorm(0, 0.001). The model definition can be found below.
Run the model for 8,000 iterations, with 2 chains, discarding the first 4,000 as ‘burn-in’. Produce trace
plots for the chains and summaries for the fitted parameters (including the missing data). Hint: You
should initialise both chains. One suggestion might be using the mean and median to initialise the
missing values of Bexley, and using random uniforms (with a narrow interval centred around say 20) to
initialise Y.