Advanced Machine Learning GR5242
Advanced Machine Learning
Advanced Machine Learning GR5242
Homework 2
Collaboration: You may collaborate on this assignment with peers from any course section. However,
you must write up your own solutions individually for submission.
Homework submission: please submit your homework by publishing a notebook that cleanly displays
your code, results and plots to pdf or html. For the non-coding questions, you can typeset using LATEX or
markdown, or you can include neatly scanned answers. Please make sure to put everything together
into a single pdf file for submission. Late submission is not accepted.
Problem 1: Loss Function Example*
Note: this is an open-ended question. You get full point for attempting it.
Give an example of a machine learning model class along with a loss function (defining a risk to
minimize) other than the following two examples: linear regression (with square loss) and neural
network (with logistic loss).
Problem 2: Differentiation
In what follows we will use the convention that the gradient is a column vector, which is the
transpose of the Jacobian in dimension 1. In fact that any vector x ∈ Rd is a column vector.
(a) Let’s consider the function g(x) = Ax, where x ∈ Rd and A ∈ Rk×d. In the special case
k = 1, show that the gradient of g is equal to A>.
(b) Now, consider the case k > 1, where we might write g(x) = Ax
.
= [g1(x), g2(x), ..., gk(x)]
>.
Recall that the Jacobian is a generalization of the gradient to multivariate functions g. That
is, the Jacobian of g is the matrix of partial derivatives whose (i, j)th entry is ∂gi(x)∂xj . How
does the Jacobian matrix relate to the gradients of the components gi of g? Argue from
there that the Jacobian matrix of g above is given as Jg(x) = A.
(c) Now consider the function g(x) = x>Ax, where x ∈ Rd and A ∈ Rd×d. Show that the
gradient of g is given as ∇g(x) = Ax + A>x (it then follows that when A is symmetric,
∇g(x) = 2Ax).
Hint : You can write x>Ax =
∑
i,j Ai,j · xixj .
Problem 3: Compare Gradient Descent method and Newton Methods
Consider the following function over x ∈ R2 (depicted in Figure 1):
F (x) = x>Σx+ log(1 + exp(−1>x)), where 1 =
[
1
1
]
, and Σ =
[
5 0
0 12
]
.
The problem asks you to implement Gradient Descent and Newton methods in Python to minimize
the above function F , and in each case, to plot F (xt) as a function of iterations t ∈ [1, 2, . . . , 30]
(all on the same plot for comparison).
Fall 2023
Due: Friday October 13, 7pm, NY t ime.
Note: All quest ions carry equal weight.
Figure 1: Problem 3, a plot of the given function F .
Implementation details: For either methods, and any step size, start in iteration t = 1 at the
same point x1 = (0, 0), for fairer comparison.
Here we can calculate 2/β = 0.195, i.e., in terms of the smoothness constant β of F (via its
Hessian). According to theory, we should set the step size η for Gradient Descent to η < 2/β. In
the case of Newton, which uses a better approximation, we can use a larger step size.
Therefore, on the same figure, plot F (xt) against t ∈ [1, 2, . . . , 30] for the following 4 configura-
tions, and label each curve appropriately as given below (NM1 through GD0.2):
(NM1) Newton Method with constant step size η = 1
(GD0.1) Gradient Descent Method with constant step size η = 0.1
(GD0.19) Gradient Descent Method with constant step size η = 0.19
(GD0.2) Gradient Descent Method with constant step size η = 0.2
The code for the function F (x), its Gradient and Hessian Matrix are given below.
Sigma = np.array([[5,0],[0,0.5]])
II = np.array([[1,1]])
# x is a 2 by 1 array starting from np.array ([[0.0],[0.0]])
def func0(x):
return np.dot(np.dot(x.transpose (), Sigma), x)+math.log(1+math.exp(-np.
dot(II ,x)))
def First_derivative(x):
x1 = x[0][0]
x2 = x[1][0]
ex = math.exp(-(x1+x2))
return np.array([[10*x1-ex/(1+ex)],[x2-ex/(1+ex)]])
def Second_derivative(x):
x1 = x[0][0]
x2 = x[1][0]
ex = math.exp(-(x1+x2))
ex = ex/(1+ex) **2
return np.array([[10+ex,ex],[ex,1+ex]])
Problem 4: Taylor’s Remainder Theorem
Page 2
Consider a convex function F : Rd → R. Suppose its Hessian is positive definite, i.e. ∇2F 0,
and is continuous for every x. Assume that F achieves its minimum at x∗. Show that
F (x)− F (x∗) ≤ (x− x∗)>∇F (x)
Hint : Use Taylor’s Remainder Theorem (which holds for instance when ∇2F is continuous).
Problem 5: Regular SGD vs Adagrad optimization on a Regression problem
Consider a random vector X ∼ N (0,Σ) with covariance matrix Σ = diag (σ2∗1, σ2∗2, . . . , σ2∗d).
Here we let d = 10 and σ∗i = 2−i · 10. Let the response Y be generated as
Y = w>∗ X +N (0, σ2),
for an optimal vector w∗ = [1, 1, 1, ..., 1]> (supposed to be unknown), and σ = 1. Consider the
following Ridge objective, defined over i.i.d. data {(Xi, Yi)}ni=1 from the above distribution:
F (w) =
1
n
n∑
i=1
(Yi − w>Xi)2 + λ‖w‖2,
for a setting of λ = 0.1. We want to minimize F over choices of w to estimate w∗.
You are asked to implement the following two optimization approaches, namely SGD (Stochastic
Gradient Descent) and Adagrad, given in the two pseudocodes below. The only differences
between the two approaches are in the setting of the step sizes ηt (in the case of Adagrad, ηt is
a diagonal matrix of step sizes).
For both procedures, we need the following definition of a stochastic gradient. First write
F (w) =
1
n
n∑
i=1
fi(w), that is, we let fi(w) = (Yi − w>Xi)2 + λ‖w‖2.
Thus, the term fi(w) is just in terms of the random data sample (Xi, Yi).
We define the stochastic gradient (evaluated at w) at time t by ∇ft(w) (that is at index i = t).
At step t in both of the procedures below, when the current w is wt, we use the common notation
∇˜F (wt) .= ∇ft(wt),
Algorithm 1: Regular SGD
The steps below use a parameter β to be specified;
At time t=1: set w1 = 0;
while t ≤ n do
Compute ∇˜F (wt) on tth datapoint (Xt, Yt);
Set ηt =
√
1
β·σ2t , where σ
2
t =
t∑
s=1
‖∇˜F (ws)‖2 (sum over past ∇˜’s);
Update wt+1 = wt − ηt∇˜F (wt);
t = t+1;
end
Page 3
Algorithm 2: Adagrad
The steps below use a parameter β to be specified;
At time t = 1: set w1 = 0;
while t ≤ n do
Compute ∇˜F (wt) on tth datapoint (Xt, Yt);
∀i ∈ [d], set ηt,i =
√
1
β·σ2t,i
, where σ2t,i =
t∑
s=1
(∇˜iF (ws))2 (sum over ith coordinate of past ∇˜’s);
Update wt+1 = wt − ηt∇˜F (wt), where now ηt = diag (ηt,1, . . . , ηt,d);
t = t+1;
end
(a) What is ∇˜F (wt) in terms of (Xt, Yt) and wt?
(b) Implement the above two procedures in Python, using β = 2(σ2∗1+λ) (smoothness measure).
(c) Experiment: Run each of the two procedures on n = 1000 datapoints, generated from the
above distribution, using the sampler provided below. Repeat 10 times, every time starting
at w1 = 0 as specified. Plot ‖wt − w∗‖ against t = 1, 2, . . . , 1000, averaged over the 10
repetitions. Show error bars (i.e., std over repetitions at each t).
Report the plots in (c) for the two procedures on the same figure for comparison.
Data sampler code:
## Code for generator/sampler
import numpy as np
import random
import time
from numpy import linalg as LA
import statistics
#initialization
sigma = 1
d = 10
c_square = 100
cov = np.diag([(0.25 ** i)*c_square for i in range(1,d+1)])
mean = [0]*d
#coeficient given
w = np.array([1]*d)
# Sampler function
def sampler(n):
#data X generator
np.random.seed(int(time.time()*100000)%100000)
X = np.random.multivariate_normal(mean , cov , n)
#data Y generator
Y = np.matmul(X, w)+np.random.normal(0, sigma **2, n)
return (X,Y)