Statistical Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MFIT5010 01 3
Statistical Machine Learning 3
1 3
INSTRUCTIONS:
1. Answer ALL of the following questions.
2. The full mark for this examination is 100.
3. Answers without sufficient explanations/steps receive no or partial marks.
4. A calculator is allowed during the exam. Internet access (e.g., Google search) is NOT
allowed.
5. Open book and open notes, but each student should work on it independently.
1. (10 marks)
Consider a data set in which each data point {yi,xi}i=1,...,n is associated with a weighting
factor wi > 0, so that the sum-of-squares error function becomes
min
β
1
2
n∑
i=1
wi(yi − βTxi)2.
Find the optimal solution of the above problem.
2. (20 marks)
Consider a data set D = {xi, yi}, i = 1, . . . , n for classification, where xi = [xi1, . . . , xip] ∈
Rp and yi ∈ {0, 1}, and n is the number of samples. Suppose we choose to minimize
the exponential loss function L(y, F ) = 1
n
∑n
i=1 exp(−(2yi − 1)F (xi)), where F (x) = f0 +∑M
m=1 fm(x) is an additive model, f0 is the intercept term and fm(x) is to be fitted by a
regression tree.
(a) Estimate f0 and justify your result. (5 marks)
(b) Suppose we have fitted F (x) as Fˆm(x) at the m-th step. Using the gradient boosting
approach, we would like to find fm+1(x) by minimizing
1
n
∑n
i=1(−gm,i − fm+1(xi))2,
where gm,i is the functional gradient evaluated at the current step. Derive the closed
form of −gm,i. (5 marks)
(c) Suppose we have fitted a tree with J-terminal nodes by solving the optimization prob-
lem in (b). Let Tˆ (x) =
∑J
j=1 cˆjI(x ∈ Sj) be the regression tree, where Sj is the j-th
partition region and I(·) is the indicator function. To improve the minimization of the
chosen exponential loss function, please re-adjust the constant cˆj by solving the fol-
lowing optimization problem using the Newto-Raphson method, given the fitted Fˆm(x)
and the partition {Sj}j=1,...,J :
min
cj
1
n
n∑
i=1
[
L
(
yi, Fˆm(xi) +
J∑
j=1
cjI(x ∈ Sj)
)]
.
Hint: You need to derive the closed form of gradient and Hessian. (10 marks)
The Hong Kong University of Science and Technology Page: of
Spring EXAMINATION, 2019-2020
Course Code: Section No.: Time Allowed: Hour(s)
Course Title: Total Number of Pages:
MFIT5010 01 3
Statistical Machine Learning 3
2 3
3. (10 marks)
Consider a data set D = {xi, yi}i=1,...,n for a classification problem, where n = 100 samples
are in two equal-sized classes yi ∈ {0, 1} and xi ∈ R10×1. It is known that the design matrix
X = [xT1 , . . . ,x
T
n ] ∈ R100×10 is a full rank matrix. Suppose we used the standard software to
apply logistical regression (linear model in the logit scale) to this data set but the software
sent a warning message “algorithm did not converge”. What is the problem here? Can linear
discriminant analysis (LDA) be applied to D without such a numerical problem? Justify
your answer.
4. (10 marks)
Consider a set of D binary variables xi, where i = 1, . . . , D, each of which is governed by a
Bernoulli distribution with parameter µi, so that
p(x|µ) =
D∏
i=1
µxii (1− µi)1−xi
where x = (x1, . . . , xD)
T and µ = (µ1, . . . , µD)
T . Now let us consider a finite mixture of
these distributions given by
p(x|θ) =
K∑
k=1
pikp(x|µk)
where θ = {µ1, . . . ,µK , pi1, . . . , piK}, and p(x|µk) =
∏D
i=1 µ
xi
ki(1−µki)1−xi . Assuming the set
of parameters θ is known, what are the mean vector E[x] and the covariance matrix Cov[x]?
5. (10 marks)
Consider a scenario with n = 50 samples in two equal-sized classes, and p = 5, 000 quanti-
tative predictors (standard Gaussian) that are independent of the class labels.
(a) Based on misclassification error rate, what is the training error rate of 1-nearest neigh-
bor (1-NN) classifier? What is the true test error rate of 1-nearest neighbor classifier?
(5 marks)
(b) Suppose we carried out the following strategy: (1) select the 100 predictors having
highest correlation with the class labels. (2) Use a 1-NN classifier based on the selected
100 for classification. (3) Use cross-validation to estimate the misclassification error
rate of 1-NN classifier based on the selected subset {xi, yi}i=1,...,n, where xi ∈ R100×1.
Do you think this is a correct way of cross-validation? If yes, please justify your answer;
if no, please give the correct way of cross-validation. (5 marks)
The Hong Kong University of Science and Technology Page: of
Spring EXAMINATION, 2019-2020
Course Code: Section No.: Time Allowed: Hour(s)
Course Title: Total Number of Pages:
MFIT5010 01 3
Statistical Machine Learning 3
3 3