Problem 1.1 (30 points) LASSO (Least Absolute Shrinkage and Selection Operator) promotes sparsity.
Assume the training data points are denoted as the rows of a n × d matrix X and their corresponding responses as an n-dimensional vector y. For the sake of simplicity, assume columns of data have been standardized to have mean 0 and variance 1, and are uncorrelated (i.e. X TX = nI).
For LASSO regression, the optimal weight vector is given by:
where λ > 0.
• First, show that standardized training data nicely decouples the loss function, making wi(*) deter- mined by the i-th feature and the output regardless of other features. That is to say, we need to explicitly reformulate Jλ (w) in the following form for appropriate functions g and f :
where Xi is the i-th column of X .
• Assume that wi(*) > 0. What is the value of wi(*) in this case?
• Assume that wi(*) < 0. What is the value of wi(*) in this case?
• From the previous two parts, what is the condition for wi(*) to be zero?
• Compute the closed-form. solution w* of LASSO regression given a fixed λ > 0 if X TX is invertible.
Problem 1.2 (30 points) Ridge Regression. The ridge regression optimization problem with parameter λ > 0 is given by
1. Show that R(ˆ)ridge (w) is convex with regards tow. You can use the fact that a twice differentiable function is convex if and only if its Hessian H ∈ Rd×d satisfies w THw ≥ 0 for all w ∈ Rd (is positive semi-definite).
2. Derive the closed form solution wr(*)idge of (RidgeR).
3. Show that (RidgeR) admits the unique solution wr(*)idge for any matrix X, even X TX ∈ Rd×d is singular (i.e., rank(XTX) < d).
4. What is the role of the term λ∥w∥2(2) in R(ˆ)ridge (w)? What happens towr(*)idge as λ → 0 and λ → ∞?
Problem 1.3 (40 points) Coding Problem
Before you start: please install anaconda python (python 3.x version is recommended) by following the instructions at https://www.anaconda.com/download/. You may also choose to use other IDES (like VS Code) as long as you are familiar with them.
If you decide to install Anaconda Python, ensure that the following packages are installed: numpy, py- torch (torch), scipy, matplotlib, and jupyter notebook. You can install these packages within Anaconda by running commands like conda install numpy. Note that some of these packages may already be included, depending on the version of Anaconda you install.
The included archive contains skeleton Python code, hw1.ipynb, which you must complete. To begin, navigate to the directory where this file is located in your terminal, and run the command jupyter notebook. This will automatically open a local webpage in your web browser. Click on the hw1.ipynb file to open the notebook.
Sections you need to complete are marked with TODO comments (search for all occurrences of TODO). Complete the scripts below each TODO marker. Additionally, there are some questions that you need to answer. You can either type your answers directly into the .ipynb file by adding a Markdown block or include them in your write-up. If you choose the latter, please specify this clearly in your .ipynb file.
After completing all tasks, run all the blocks and then export your notebook as an HTML file showing the outputs. Submit both the completed .ipynb file and the corresponding HTML file.
1. First Task: Variants of Linear Regression.
In the first task, you will compare the optimal solutions of linear regression, ridge regression, and lasso regression. A dataset of 1, 000 samples following a linear model will be provided:
Y = Xw + ϵ,
where XT = (x1 | · · · | xn ) ∈ R2 ×n and Y T = (y1 | · · · | yn ) ∈ R1 ×n with n = 1000 and w ∈ R2 and the error term ϵ ∈ Rn arei.i.dzero-mean random variables.
In this part, you are asked to code the following linear regression and its two variants,
(a) Read 1 .Functions in the file, please implement linear regression, ridge regression and lasso regression in the function lin_reg, rid_reg, lasso_reg separately. Generate the datasets and perform numeric experiment in 2 .Datasets and 3.Experiment-(a).
(b) In 4.Experiment-(b), we gradually increase the value of parameter λ from 1.1 −3 to 1.190 , and then perform ridge regression and lasso regression. Observe the plots shown in 5 .Plot. It shows the optimal solutions of ridge regression/lasso regression under different lambdas. [Questions: Explain the behavior of the optimal solutions as λ increases. Describe how λ affects the optimal solutions for the two proposed variants of linear regression. Why do their behaviors differ, and in what ways are their tendencies distinct from one another?]
2. Second Task: Vanilla Gradient Descent (VGD)&Stochastic Gradient Descent (SGD) for Linear Regression.
In the second task, you are provided with a dataset of 1, 000, 000 samples. Its input and output also follow a linear model
Y = Xw + ϵ,
where XT = (x1 | · · · | xn ) ∈ R10 ×n and Y T = (y1 | · · · | yn ) ∈ R1 ×n with n = 1, 000, 000 and the error term ϵ ∈ Rn arei.i.dzero-mean random variables.
(a) Read 1 .Functions in the file. The function VGD_auto and VGD_manual train the model us- ing vanilla gradient descent (VGD), as discussed in the lecture. In the function VGD_auto, a built-in gradient optimization method in PyTorch to compute the gradients is implemented.
Your task is to calculate the exact expression of the gradient w.grad = ∂L/∂w manually, and implement VGD in VGD_manual. [Hint: Remember to include the constant factor of 2 when computing the derivative of the OLS loss function.] Next, generate the datasets and perform numeric experiment in 2 .Datasets and 3 .Experiment. Check if the losses of VGD_manual and VGD_auto are the same.
(b) You might realize that the result is not satisfying. In this part, your task is to find proper parameters(e.g., number of iterations, learning_rate) in 2 .Datasets such that the result meets the following requirement:
• The running time requirements are within 0.1 second and 10 seconds separately for SGD and VGD(automatically or manually computed).
• The losses are below 2 for all algorithms.
Observe the plots shown in 4.Plot. [Questions: Compare the differences in numerical per- formances (losses and running time) between the VGD and SGD. Why do their behaviors differ, and in what ways are they distinct from one another? ]