Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Math6168 Machine Learning: Coursework 2
Worth: 30 marks.
(a) Handed in via Turnitin Assignments on Blackboard by the deadline
specified above.
(b) Standard university guidelines will be followed for late coursework.
(c) All coursework must be carried out and written up independently.
Standard School of Mathematics guidelines will be used to detect
excessive collaboration and plagiarism, and appropriate penalties
will be issued if required. All suspicious cases will be referred to
the academic integrity officer!
(d) The page limit is strict and is easily sufficient to receive full credit.
All materials related to a question, including plots and any appen-
dices, must fall within these limits. You do not have to submit the
computer code which you used for the analysis (unless requested
in question), but you should explain clearly what analysis has been
done. Marks will be deducted for work which exceeds the page
limits. If you have too much material then you will need to decide
what is important and what can be left out.
(e) The questions involve the modelling of real data. There is not nec-
essarily a single ‘correct’ answer. Submissions which demonstrate
a good appreciation of statistical modelling principles, together with
correct application of appropriate methods will receive high marks.
Copyright 2022 © University of Southampton Number of Pages 3
2 MATH6168W1
1. [Total 10 marks, no page limit]
Suppose that yi = β0 + β1Xi1 + β2Xi2 + i, with Xi2 = Xi1 + 1, for i =
1, 2, · · · , n, where i, i = 1, · · · , n, are independent and identically distributed er-
rors with E(i) = 0 and V ar(i) = σ2, and Xi = (1, Xi1, Xi2)′ is a non-random
design vector. We denote by X ′ the transpose of a vector or matrix X , and let
β = (β0, β1, β2)
′ be the vector of parameters to be estimated based on a set of data
(yi, Xi1, Xi2), i = 1, 2, · · · , n.
(a) Write out the least squares regression optimization problem. Can you find the
least squares estimator (LSE) of β = (β0, β1, β2)′? If yes, give your LSE of β =
(β0, β1, β2)
′; if no, provide your reasoning, and specify the vector of parameters
that you can estimate and give its LSE. Hence give an estimator of σ2.
[3 marks]
(b) Write out the ridge regression optimization problem for β = (β0, β1, β2)′, and
derive the ridge regression estimator of β = (β0, β1, β2)′ or the vector of param-
eters that you believe can be estimated, and hence give an estimator of σ2, with
a given ridge tuning parameter denoted by λ > 0. [4 marks]
(c) Write out the lasso regression optimization problem for β = (β0, β1, β2)′ with a
lasso tuning parameter denoted by λ > 0. For β = (β0, β1, β2)′ or the vector of
parameters that you believe can be estimated, describe a method for determining
λ by leave-one-out cross validation (LOOCV) with necessary steps and details.
[3 marks]
2. [Total 20 marks, 5 sided A4 pages maximum in at least font size 10]
The data set in the file HittersPart.txt, available on Blackboard, provides a frame
of Major League Baseball Data from the 1986 and 1987 seasons with 260 observa-
tions of major league players on the following 19 variables: AtBat [Number of times
at bat in 1986], Hits [Number of hits in 1986], HmRun [Number of home runs in 1986],
Runs [Number of runs in 1986], RBI [Number of runs batted in in 1986], Walks [Num-
ber of walks in 1986], Years [Number of years in the major leagues], CAtBat [Number
of times at bat during his career], CHits [Number of hits during his career], CHmRun
[Number of home runs during his career], CRuns [Number of runs during his career],
CRBI [Number of runs batted in during his career], CWalks [Number of walks during
his career], League [A factor with levels A and N indicating player’s league at the end
of 1986], Division [A factor with levels E and W indicating player’s division at the
end of 1986], PutOuts [Number of put outs in 1986], Assists [Number of assists in
1986], Errors [Number of errors in 1986], Salary [1987 annual salary on opening
day in thousands of dollars].
(Question 2 continued on next page)
3 MATH6168W1
We will now try to predict whether a player has a high salary or not, defined based on
if the value of Salary [1987 annual salary on opening day in thousands of dollars] is
higher than its median or not, in the HittersPart data set.
(a) Define in R code a new factor variable salary01 that takes on value of 1 if the
value of Salary exceeds its median and 0 otherwise. Also write R code to split
the data set into a training set and a test set in a proportion of 80% and 20%,
respectively.
[4 marks]
In each of Parts (b)-(d) below, fit the model as required on the training set, present
the estimated coefficients, and report the misclassification error rate of prediction for
the test data set:
(b) A linear logistic model using maximum likelihood;
[4 marks]
(c) A ridge logistic regression model, with λ chosen by cross-validation in misclassi-
fication error;
[4 marks]
(d) A lasso logistic model, with λ chosen by cross-validation in misclassification error;
[4 marks]
(e) Compare the results obtained in (b)–(d). How accurately can you predict whether
a player has a high salary or not in 1987 based on the information of performances
of the player in 1986? Is there much difference among the test errors resulting
from these three approaches? Give your reasoning.