Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Problem 1 (20 points). In the lectures, we showed that the MLE estimator for linear regression when
the random noise for each data point is identically and independently distributed (i.i.d.) from a Guassian
distribution N (0, σ2
) reduces to the linear regression with squared loss (OLS). In this problem, we would like
to change the distribution of the noise model and derive the MLE optimization problem. In particular, let’s
assume for a fixed unknown linear model w∗ ∈ R
d
the response yi for each data point xi ∈ R
d
in training
data S = {(x1, y1),(x2, y2), . . . ,(xn, yn)} is generated by
yi = w⊤
∗ xi + ϵi
,
where ϵi
is generated i.i.d. from a Laplace distribution ϵi ∼ Laplace(0, σ) with density function:
P(x | 0, σ) = 1
2σ
exp (
−
|x|
σ
)
=
1
2σ
exp (
x
σ
)
if x < 0
exp (
−
x
σ
)
if x ≥ 0
for x ∼ Laplace(0, σ).
Under above assumption on the noise model,
• Show that each yi
, i = 1, 2, . . . , n is a random variable that follows Laplace(w⊤
∗ xi
, σ) distribution.
• Write down the MLE estimator for training data in S and derive the final optimization problem. Note
that you just need to state the final minimization problem and not its solution.
• Compare the obtained optimization problem to one obtained under Gaussian noise model and highlight
key differences.
Problem 2 (20 points). Suppose we run a ridge regression with regularization parameter λ on a training
data with a single variable S = {(x1, y1),(x2, y2), . . . ,(xn, yn)} , and get coefficient w1 ∈ R (for simplicity,
we assumed the data are centered and no bias (intercept) term is needed).
We now include an exact copy of first feature to get a new training data as
S
′ = {([x1, x1]
⊤, y1),([x2, x2]
⊤, y2), . . . ,([xn, xn]
⊤, yn)}
where each training example is a 2 dimensional vector with equal coordinates, refit our ridge regression and
get the solution [w
′
1
, w′
2
]
⊤ ∈ R
2
.
• Derive the optimal solution for [w
′
1
, w′
2
]
⊤ and show that w
′
1 = w
′
2
.
• What is relation between w
′
1 and w1.
Problem 3 (20 points). As we discussed in the lecture, the Perceptron algorithm will only converge if the
data is linearly separable, i.e., there exists a linear classifier that can perfectly classify the training data.
If the training data S = {(x1, y1),(x2, y2), . . . ,(xn, yn)} where xi ∈ R
d and yi ∈ {−1, +1} is not linearly
separable, then there is a simple trick to force the data to be linearly separable and then apply the Perceptron
algorithm as follows. If you have n data points in d dimensions, map data point xi to the (d+n)-dimensional
point [xi
, ei
]
⊤ ∈ R
d+n, where ei ∈ {0, 1}
n is a n-dimensional vector of zeros, except for the ith position,
which is 1 (e.g., e4 = [0, 0, 0, 1, 0, . . . , 0]⊤).
Show that if you apply this mapping, the data becomes linearly separable (you may wish to do so by
providing a weight vector w in (d + n)-dimensional space that successfully separates the data).