Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS6140: Machine Learning
Homework Assignment # 4
Problem 1. (15 points) Consider a logistic regression problem where X = Rd and Y = {−1,+1}. Derive
the weight update rule that maximizes the conditional likelihood assuming that a data set D = {(xi, yi)}ni=1
is given.
Problem 2. (20 points) Consider a logistic regression problem with its initial solution obtained through the
OLS regression; i.e., w(0) = (XTX)−1XTy , in the context of the code provided in class (week 6). Recall
that x was drawn from a mixture of two Gaussian distributions with dim{x} = 2 (before adding a column
of ones) and that y ∈ {0, 1}. You probably noticed that the initial separation line is consistently closer to
the data points of class 0.
a) (10 points) Why was this the case? Draw a picture (if possible) to support your argument.
b) (5 points) Devise a better initial solution by modifying the standard formula w(0) = (XTX)−1XTy.
c) (5 points) Now again consider the case where y ∈ {−1,+1}. What is the form of the modified solution
from part (b) in this case?
Problem 3. (40 points) Consider two classification concepts given in Figure 1, where x ∈ X = [−6, 6] ×
[−4, 4], y ∈ Y = {−1,+1} and p(y|x) ∈ {0, 1} is defined in the drawing.
-6
A) B)
-6
-4 -4
4 4
0 0
0 06 6
( 2, 1)- -
(2, 1)
+
+
+
-
-
-
-
( 4, 3)-
( 1, 2)- -
(2, 0)
+
+
+
-
( 4, 3)-
Figure 1: Two concepts where examples that fall within any of the three 3 × 3 (panel A) or 1 × 1 (panel
B) squares are labeled positive and the remaining examples (outside each of the squares but within X ) are
labeled negative. The position of the point x = (x1, x2) in the upper left-hand corner for each square is
shown in the picture. Consider horizontal axis to be x1 and vertical axis as x2.
1
2 Homework Assignment # 4
Your experiments in this question will rely on generating a data set of size n ∈ {250, 1000, 10000} drawn
from a uniform distribution in X and labeled according to the rules from Figure 1; e.g., P (Y = 1|x) = 1 if x
that was randomly drawn is inside any of the three squares in either of the two panels, and P (Y = 1|x) = 0
otherwise. The goal of the following two problems will be to train and evaluate classifiers created from the
data generated in this way. You can use any library you want in this assignment and do programming in
Python, MATLAB, R or C/C++. Your code should be easy to run for each question and sub-question below
so that we can replicate your results to the maximum extent possible.
Consider single-output feed-forward neural networks with one or two hidden layers such that the number of
hidden neurons in each layer is h1 ∈ {1, 4, 12} and h2 ∈ {0, 3}, respectively, with h2 = 0 meaning that there
is no second hidden layer. Consider one of the standard objective functions as your optimization criterion
and use early stopping and regularization as needed. Consider a hyperbolic tangent activation function in
each neuron and the output but you are free to experiment with others if you’d like to. For each of the
architectures, defined by a parameter combination (h1, h2), evaluate the performance of each model using
classification accuracy, balanced accuracy, and area under the ROC curve as your two performance criteria.
To evaluate the performance of your models use cross-validation. However, to evaluate the performance of
performance evaluation, generate another very large data set on a fine grid in X . Then use the predictions
from your trained model on all these points to determine the “true” performance. You can threshold your
predictions in the middle of your prediction range (i.e., at 0.5 if you are predicting between 0 and 1) to
determine binary predictions of your models and to then compare those with true class labels you generated
on the fine grid.
Provide meaningful comments about all aspects of this exercise (performance results for different network
architectures, accuracy of cross-validation, run time, etc.). The comments should not just re-state the results
but rather capture trends and give reasoning as to why certain behavior was observed.
Problem 4. (70 points) Matrix factorization with applications. Consider an n× d real-valued data matrix
X = (xT1 ,x
T
2 , . . . ,x
T
n ). We will attempt to approximate this matrix using the following factorization
Xˆ = UVT
where U = (uT1 ,u
T
2 , . . . ,u
T
n ) is an n × k and V = (vT1 ,vT2 , . . . ,vTd ) is a d × k matrix of real numbers,
and where k < n, d is a parameter to be explored and determined. Notice that the value xij in X can be
approximated by uTi vj and that x
T
i , the i-th row of X, can be approximated by u
T
i V
T , giving xˆi = Vui.
We will formulate the matrix factorization process as the following minimization
min
U,V
∑
i,j
(xij − uTi vj)2 + λ(
∑
i
||ui||2 +
∑
j
||vj ||2)
which minimizes the sum-of-squared-errors between real values xij and reconstructed values xˆij = u
T
i vj .
The regularization parameter λ ≥ 0 is user-selected. This problem can be directly solved using gradient
descent, but we will attempt a slightly different approach. To do this, we can see that for a fixed V we can
find optimal vectors ui by minimizing
||Vui − xi||2 + λ · ||ui||2
which can be solved in a closed form using OLS regression for every i. We can equivalently express the
optimization for vectors vj and find the solution for every j. Then, we can alternate these steps until
convergence. This procedure is called the Alternating Least Squares (ALS) algorithm for matrix factorization.
It has the following steps:
Initialize U and V
repeat
for i = 1 to n
Homework Assignment # 4 3
ui = formula #1
end
for j = 1 to d
vj = formula #2
end