Statistical Machine Learning
Statistical Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Homework 2 Statistical Machine Learning
Notes: Please upload all your code with your assignment on Canvas before 7pm on
March 31, 2022. Homework must be neatly written-up or typed for submission. I reserve
the right to refuse homework that is deemed (by me) to be excessively messy.
1. Probabilistic PCA and Factor Analysis. Suppose m < p and thatW is some p×m matrix.
Suppose further that Z ∼ N(0, Im) and that X|Z ∼ N(µ+WZ, σ2Ip). This is called the
probabilistic PCA model.
(a) Let Σ = Cov(X). Prove that the eigenvectors of Σ are the same as the eigenvectors
of WW>. Thus, estimating the eigenvectors of Σ is equivalent to estimating the
eigenvectors of WW>.
(b) Given training data x1, . . . , xN ∈ Rp, prove that the MLE of µ is µ̂ = 1N
∑N
i=1 xi.
(c) Explain the relationship between probabilistic PCA and statistical factor analysis.
2. LARS, Lasso and Ridge. This problem uses the big8 dataset, which is available on
Canvas. The dataset contains information on 8 companies from the year 2004. The variable
RETX contains the daily simple returns. For this problem we take the return of the S&P500
index (labeled sprtrn in the data file) to be the output (i.e. y), and the returns of AIG,
C, COP, F, GE, GM, IBM, XOM (labeled RETX in the data file) on the same day to be
inputs (i.e. X). Divide the complete dataset into training and testing datasets, by date: The
training dataset should contain the data from Jan. 2, 2004 to June 30, 2004; the testing
dataset should contain the data from July 1, 2004 to Dec. 31. 2004. Run the following
regression methods on the training data:
(i) LARS
(ii) Lasso
(iii) Ridge regression
You may use the linear_model of sklearn to fit LARS, lasso and Ridge regression. Remember
to include the intercept term in your regression models and to standardize appropriately!
(a) For LARS and Lasso, list the sequence in which each of the predictors enter the
regression model.
(b) For ridge regression, fit the estimators for a fine grid of reasonably chosen tuning
parameter values λ (use at least 100 values of λ)
(c) For each method use (i) AIC and (ii) 5-fold cross- validation to pick a “final model”,
based only on the training data. Estimate the test error for your final models using
the test data, i.e. find the average prediction error. How does the test error for the
final models compare to the minimum test error for each method? (The minimum test
error for a given method is found by computing the test error for all tuning parameter
values, and then finding the minimum.)
Homework 2 Statistical Machine Learning II, Semeter B 2021/2022
3. EM algorithm for variance component estimation. This problem also uses the big8 dataset.
But first, consider a linear model with inputs x1, . . . , xN ∈ Rp and corresponding outputs
y1, . . . , yN ∈ R. The inputs and outputs satisfy
y = Xβ +
where y = (y1, . . . , yN )>, X = (x1, . . . , xN )>, β = (β1, . . . , βp)>, = (1, . . . , N )>. Further
assume that 1, . . . , N ∼ N(0, σ2) and β1, . . . , βp ∼ N(0, τ2) are all independent. This is
called a random effects model and σ2, τ2 are called the variance components. The goal of
the exercise is to derive a method for estimating σ2 and τ2 using the EM-algorithm.
(a) Write down the lokelihood for θ = (σ2, τ2), based on the observed data (y,X).
(b) Now suppose that β is also (somehow) observed. Write down the likelihood for θ based
on the data (y,X,β).
(c) We wish to use the EM-algorithm to find the MLE of θ, which we denote by θ̂ = (σ̂2, τ̂2),
If θ̂old = (σ̂2old, τ̂
2
old) is the estimate for θ at a given step of the EM-algorithm, show
that the updated estimate is θ̂new = (σ̂2new, τ̂2new), where
σ̂2 =
1
N
‖y −Xβ̂old‖22 +
1
N
tr(X>XΣ̂old),
τ̂2new = ‖β̂old‖22 + tr(Σ̂old),
β̂old = (X
>X +
σ̂2old
τ̂2old
Ip)
−1X>y
Σ̂old = (
1
σ̂2old
X>X +
1
τ̂2old
Ip)
−1.
(d) Discuss the relationship between β̂old, which is defined in part (c), and ridge regression.
What does this approach (the random effects model) suggest regarding how to choose
the tuning parameter for ridge regression?
(e) In Problem 2, you used AIC and 5-fold cross-validation to select the tuning parameter
for ridge regression when analyzing the big8 dataset. Analyze the big8 dataset again,
this time using the method you described in part (d) to select the tuning parameter.
Compare the test error for this method to that of AIC and 5-fold cross-validation.