Machine Learning and Statistical Methods
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STK2100 –– Machine Learning and Statistical Methods
for Prediction and Classification - Home exam
Day of examination: June 15 -2021
Examination hours: 09.00 – 13.00.
This problem set consists of 9 pages.
Appendices: None
Permitted aids: Anything available
Please make sure that your copy of the problem set is
complete before you attempt to answer anything.
All subquestions are counted equally!
Problem 1
An important measure to keep low during the Covid-19 pandemic has been
the number of people ending up at hospital. The figure below shows the
number of new arrivals to hospitals among Oslo citizens for each day during
the pandemic. The additional plots show the number of positive tests and
the total number of tests in the same period.
0
5
10
15
20
Apr 2020 Jul 2020 Oct 2020 Jan 2021 Apr 2021
dates
va
lu
e variable
mat
Hospitalization Oslo
(Continued on page 2.)
Exam in STK2100, June 15 -2021 Page 2
0
100
200
300
400
Apr 2020 Jul 2020 Oct 2020 Jan 2021 Apr 2021
dates
va
lu
e variable
mat
Positive tests Oslo
0
1000
2000
3000
Apr 2020 Jul 2020 Oct 2020 Jan 2021 Apr 2021
dates
va
lu
e variable
mat
Total tests Oslo
Our aim in this exercise will be to see if the test data can be used for
prediction of the number of hospitalizations.
We will introduce the following variables (where each variable correspond to
citizens with residence in Oslo):
yt The number of new arrivals at hospital on day t
vt The number of positive tests at day t
zt The number of tests performed at day t
(a) Consider first a general setting where
yt ~Binom(N, pt);
logit(pt) =β0 +
p∑
j=1
βjxt,j.
where xt = (xt,1, ..., xt,p) is the collection of all covariates involved in
the modelling of pt. Here pt can be interpreted as the probability of a
random individual being hospitalized due to the Covid-19 virus at day
t. Further, define y?t = Np?t where
logit(p?t) =β?0 +
p∑
j=1
β?jxt,j.
(Continued on page 3.)
Exam in STK2100, June 15 -2021 Page 3
Show that
E[(yt ? y?t)2|xt]
=Npt(1? pt) + E[(y?t ?Npt)2|xt]? 2E[(yt ?Npt)(y?t ?Npt)|xt]
Give an interpretation of each term on the right hand side.
Can we neglect the last term on the right hand side in this case?
For which value of pt is the term Npt(1? pt) maximized?
Due to that we want to make predictions one-week ahead, we will consider the
following model, with N being the population size in Oslo (here for simplicity
assumed to be constant equal to 681 071 over the whole period):
yt ~Binom(N, pt)
logit(pt) =β0 +
3∑
j=1
βjvt?7?j +
3∑
j=1
β3+jzt?7?j
where we also assume all observations are independent.
Note that using test-data for some days earlier makes sense in this case due
to that it typically takes 10-12 days from infection until one (potentially)
becomes so sick that one needs to go to hospital. The delay from people get
infected until they take a test typically varies between 2 and 5 days.
Note: The test data we talk about here is something different from the test
data we have talked about during the course.
When fitting the model above, we obtained many non-significant coefficient,
so two model selection procedures were considered, giving the following
regression tables (where v8 corresponds to vt?8 and so on):
Model 1
Co e f f i c i e n t s :
Estimate Std . Error z va lue Pr(>| z | )
( I n t e r c ep t ) ?1.377 e+01 1 .049 e?01 ?131.303 < 2e?16 ???
v8 4 .353 e?03 6 .106 e?04 7 .130 1 .00 e?12 ???
v10 3 .858 e?03 6 .102 e?04 6 .322 2 .59 e?10 ???
z10 3 .953 e?04 6 .477 e?05 6 .102 1 .05 e?09 ???
Model 2
Co e f f i c i e n t s :
Estimate Std . Error z va lue Pr(>| z | )
( I n t e r c ep t ) ?1.370 e+01 1 .101 e?01 ?124.415 < 2e?16 ???
v8 3 .669 e?03 7 .469 e?04 4 .913 8 .98 e?07 ???
v9 1 .786 e?03 8 .549 e?04 2 .089 0 .0367 ?
z9 ?1.752e?04 9 .867 e?05 ?1.775 0 .0759 .
v10 2 .860 e?03 7 .344 e?04 3 .894 9 .85 e?05 ???
z10 5 .064 e?04 9 .487 e?05 5 .337 9 .45 e?08 ???
(Continued on page 4.)