Advanced Machine Learning GR5242
Advanced Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Advanced Machine Learning GR5242
Homework 1
Due: Friday September 29, 7pm.
Note: All questions carry equal weight.
Collaboration: You may collaborate on this assignment with peers from any course section. However,
you must write up your own solutions individually for submission.
Homework submission: please submit your homework by publishing a notebook that cleanly displays
your code, results and plots to pdf or html. For the non-coding questions, you can typeset using LATEX or
markdown, or you can include neatly scanned answers. Please make sure to put everything together
into a single pdf file for submission. Late submission is not accepted.
Problem 1: EM algorithm for spherical Gaussians & Lloyd’s algorithm
This problem aims to draw links between EM for Gaussian mixtures and Lloyd’s algorithm. In
particular, we might view Lloyd’s as an approximation to EM as instantiated for mixtures of
spherical Gaussians with densities (on Rd) of the following form:
f(x) =
K∑
k=1
πk · N (x;µk, Id) ,
∑
k
πk = 1,
where Id denotes the identity matrix in R
d.
This instantiation of EM (to estimate the MLE parameters {πk, µk}Kk=1) takes the following form,
given data {xi}ni=1:
• Choose some initial π(0)k and µ
(0)
k for k = 1, . . . ,K.
• For t = 0, 1, 2, . . ., do:
i. Soft Cluster Assignment: We will let π˜
(t)
ik denote the ”likelihood” that xi is generated
by Gaussian component k. More precisely, we set:
π˜
(t)
ik =
π
(t)
k · N (xi;µ(t)k , Id)∑K
k′=1 π
(t)
k′ · N (xi;µ(t)k′ , Id)
.
ii. Update Cluster Centers: Set µ
(t+1)
k =
∑n
i=1
π˜
(t)
ik∑n
j=1 π˜
(t)
jk
xi, i.e., a weighted average ac-
cording to the likelihood that xi was generated by component k.
iii. Update Cluster Probability: Set π
(t+1)
k =
1
n
∑n
i=1 π˜
(t)
ik , so we update our estimates of
the prior probabilities to reflect how much mass belongs to each cluster.
(a) Lloyd’s uses a “hard” cluster assignment w
(t)
ik = I{µ(t)k is closest to xi} instead of the π˜(t)ik
of EM above. Express w
(t)
ik in terms of the normal densities N (xi;µ(t)l , I) defined over the
means µ
(t)
l , l = 1, . . .K at time t. Hint: use the max(·), or argmax(·) function.
(b) Write the Lloyd’s update for µ
(t+1)
k in terms of the w
(t)
ik that you derived in the previous part.
How does your expression compare to the corresponding EM updates for µk?
Problem 2: Optimizing over K in Lloyd’s algorithm
Recall the K-means objective: given a dataset {xi}ni=1, find a partition {Ci}Ki=1 of the data,
composed of clusters Ci with centers µi, to minimize:
L(C) =
K∑
i=1
∑
Xj∈Ci
∥Xj − µi∥2.
Suppose we optimized this objective over K as well as Ci and µi. What would be the optimal
value of the objective function? Explain what is wrong with this approach.
Problem 3: Using L1 norm in K-means type objective
Suppose we change our clustering objective to use L1 norm instead of L2 norm, i.e., we aim to
minimize L({Ci}) =
∑K
i=1
∑
Xj∈Ci ||Xj −µi||1 instead of L({Ci}) =
∑K
i=1
∑
Xj∈Ci ||Xj −µi||22.
Would it still make sense, in each iteration of LLoyd’s algorithm, to assign the mean of all Xj ∈ Ci
to µi? If not, what value would you want to assign to µi at each iteration?
Problem 4: Bad Local Minima
Consider the clustering problem with K = 2 shown in the plot of Figure 1. Suppose Lloyd’s
algorithm is initialized with centers µ1 and µ2 as depicted. Why could this be problematic? Give
a brief explanation in terms of what we might converge to.
Figure 1: Problem 4
Problem 5: Projection based clustering*
Note: this is an open-ended question. You get full point for attempting it.
In projection-based clustering, we note that, if the data is well separated into clusters with means
µ1, . . . , µK , then the top K eigenvectors of the data covariance matrix, say (v1, . . . , vK), tend
to align with the Span(µ1, . . . , µK). It follows that PCA will approximately preserve the distance
between cluster means.
This intuition (and choice of projection) implicitly assumes that the Euclidean metric is the right
way of measuring distance for our particular data. Where is that assumption coming in? What
would you do otherwise, i.e., if it turns out that a different notion of distance dist(Xi, Xj) were
more appropriate, would you prefer some other transformation over PCA?
Page 2
Problem 6: Projection-based clustering (continued)
Following the notation from the previous problem, let V = [v1, . . . , vK ] ∈ Rd×K , the matrix of
top principal directions ∥vi∥ = 1. Consider two possible transformations of the data:
(i) X˜i ← V ⊤Xi, and (ii) Xˆi ← V V ⊤Xi
Would the clusters learnt from the data X˜i differ from those learnt from the data Xˆi? If we were
to run Lloyd’s on the transformed data (i) and separately on the transformed data (ii), would one
be quicker to execute than the other, i.e., is there a difference in runtime?
Problem 7: Comparing LLoyd’s algorithm and Projection-based Clustering
You are to make calls to LLoyd’s algorithm to cluster data, with and without projection. You
are given a function data generator(seed, n, dim) that generates n samples from a mixture of
3 Gaussian distributions in dimension dim = 9. The parameter seed specifies the seed used for
random sampling. The function returns two numpy arrays “datapoints, labels”, i.e., the generated
samples and their true labels in {0, 1, 2} indicating Gaussian components (modeling clusters).
###### Data Generator ######
import numpy as np
dim = 9
def data_generator(seed , n, dim):
mu1 = [1, 1, 1, 0, 0, 0, 0, 0, 0]
mu2 = [0, 0, 0, 0, 0, 0, 1, 1, 1]
mu3 = [0, 0, 0, 1, 1, 1, 0, 0, 0]
np.random.seed(seed)
datapoints = np.zeros ((n, dim))
labels = np.zeros(n)
for i in range(n):
rand_int = np.random.choice([1, 2, 3])
if rand_int == 1:
datapoints[i] = np.random.multivariate_normal(mu1 ,
np.diag([1, 1, 1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]))
labels[i] = rand_int
elif rand_int == 2:
datapoints[i] = np.random.multivariate_normal(mu2 ,
np.diag([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 1, 1, 1]))
labels[i] = rand_int
elif rand_int == 3:
datapoints[i] = np.random.multivariate_normal(mu3 ,
np.diag([0.1, 0.1, 0.1, 1, 1, 1, 0.1, 0.1, 0.1]))
labels[i] = rand_int
return datapoints , labels
In the rest of the problem, we will explore the effect of PCA on clustering performance.
(a) For each n (the number of samples) taking values in 100, 200, 300, 400, 500, generate 100
datasets of size n using data generator(seed, n, dim), with a different seed each time (for
example seed ∈ {1, . . . , 100} ). Cluster each datasets using LLoyd’s algorithm, with 3
clusters (K = 3), and calculate the Rand Index. Plot the average Rand Index (over the 100
Page 3
datasets for each value of n) against the different values of n with error bars (corresponding
to the standard deviation of Rand Index for each n).
Hint : you can use sklearn.cluster.KMeans to run LLoyd’s algorithm and
sklearn.metrics.adjusted rand score() to calculate the Rand Index.
(b) Now, perform the same analysis but project each dataset onto its 3 Principal Components
before running LLoyd’s algorithm. Again, plot the average Rand Index against the different
values of n with error bars.
(c) Superimpose the two plots to compare the performances of LLoyd’s algorithm with and
without projecting the data. What can you say about the performances of both methods?
How do the standard deviations on performance (Rand Index) compare?
Problem 8: Comparing K-means (Lloyd’s), Meanshift and Spectral Clustering on synthetic data
We will compare various clustering methods on the three data sets shown in the scatterplots
below. Use Scikit-learn module in python to import Meanshift, K-means and Spectral Clus-
tering algorithms. (for more information, please refer to tutorials on Sklearn.cluster.Meanshift,
Sklearn.cluster.Kmeans and Sklearn.cluster.spectralclustering.
from sklearn.cluster import MeanShift , KMeans , SpectralClustering
For each of the questions below, you should display the resulting clustering as a scatter plots with
different colors for each cluster (see code below).
(a) Run K-means on the 3 datasets. Why does it perform well or worse on some datasets?
(b) Run meanshift on the 3 datasets. Why does it perform well or worse on some datasets?
(c) Run Spectral clustering on the 3 datasets. Why does it perform well or worse on some
datasets?
Some instructions on implementation:
• Set bandwidth h be 20% quantile of interpoint distances when using Meanshift.
• Use attached code in generating data.
• Use attached code as a reference in plotting your result, and you are also free to plot in your
own style.
• Set K = 2 for case-1,2 and K = 3 for case-3 in using K-means and Spectral clustering.
Figure 2: Dataset 1 Figure 3: Dataset 2 Figure 4: Dataset 3
Page 4
Generating the 3 datasets:
import random
import math
random.seed(0)
##Data set 1
X1 = []
for i in range(1000):
theta = random.uniform(0,2*math.pi)
radius = random.gauss(0,0.2)+random.choice([1,3])
X1.append([radius*math.cos(theta),radius*math.sin(theta)])
X1 = np.array(X1)
##Data Set 2
X2 = []
for i in range(1000):
theta = random.uniform(0,2*math.pi)
radius = random.gauss(0,0.1) + 2
if thetaX2.append([radius*math.cos(theta)-1,radius*math.sin(theta)])
else:
X2.append([radius*math.cos(theta)+1,radius*math.sin(theta)])
X2 = np.array(X2)
##Data Set 3
X3 = []
for i in range(1000):
radius = random.gauss(0,1)
theta = random.uniform(0,2*math.pi)
center = random.choice([[0,1],[3,3],[1,-3]])
X3.append([radius*math.cos(theta)+center[0],radius*math.sin(theta)+center
[1]])
X3 = np.array(X3)
Plotting Clusters:
## assume ’label_pred ’ is list of group labels , use following code as a
reference for plotting the result.
import matplotlib.pyplot as plt
mark = [’or’, ’ob’, ’oy’, ’g’,’c’,’m’,’k’,’w’]
j = 0
for i in label_pred:
plt.plot([X[j:j+1,0]], [X[j:j+1,1]], mark[i], markersize = 5)
j +=1
plt.show()
Problem 9: Kernel Density Estimator
Suppose K(u) ≥ 0 and ∫ K(u)du = 1, is to be used as a Kernel in the following density estimate:
fˆ(x) =
1
nhd
n∑
i=1
K
(
x− xi
h
)
,
for a bandwidth parameter h > 0. Show that
∫
fˆ(x)dx = 1, i.e., is a valid density, whenever K
satisfies the above.
Hint: Remember that for a high-dimensional change of variable x → u, with x, u ∈ Rd, if
x = au+ b, a, b ∈ R, then du = (a−1)ddx.