MA4601/MAT061 Stochastic Search and Optimisation
Stochastic Search and Optimisation
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MA4601/MAT061 Stochastic Search and Optimisation
Assignment 3: Clustering with Cross-Entropy
This assignment investigates the use of the Cross-Entropy algorithm for clustering.
The report should be typed in a 12 point font and should be no more than four pages long (ex-
cluding figures).
Your code must be submitted as a single file, and must be properly documented, including
clear instructions on how to run the code. Data files may be included separately. See the module
website for restrictions on the external libraries you can use, and note that jupyter notebooks
are not accepted.
Very important: your assessment is based on the report! You must submit your code,
and I will look at it to check how well you have documented your additions, but I expect you
to explain what you have done and present your results in the report. I will usually only run
your code if it is not clear what is going on from the report, in which case you will probably be
losing marks. Of course, should I choose to run your code, it must be capable of reproducing
the results presented in the report.
Read Section 5.1 of De Boer, Kroese, Mannor, Rubinstein (2005) A tutorial on the Cross-
Entropy method. Annals of Operations Research 134, pp. 19–67.
Suppose that we are given a set of n points X ⊂ Rd that we wish to partition into k disjoint
subsets X1, . . . ,Xk in order to minimise or maximise some cluster measure S(X1, . . . ,Xk). Let
X = {x1, . . . ,xn} then we can represent a partition by a vector a = (a1, . . . , an) ∈ {1, . . . , k}n,
where ai = j when xi ∈ Xj . With this notation we suppose that S is a function from Ω :=
{1, . . . , k}n into [0,∞).
For the CE-method we will use a family of probabilities on Ω = {1, . . . , k}n that is indexed
by matrices P = (pij)i=i,...,n;j=1,...,k, where
f(a; P ) =
n∏
i=1
pi,ai .
P is constrained to be non-negative, with row-sums equal to 1. We interpret pij as the proba-
bility that point i is in cluster j.
1. (4 marks) For l = 1, . . . ,M let a(l) = (a1(l), . . . , an(l)) be an element of Ω. Prove that
the matrix P that solves
max
M∑
l=1
log f(a(l); P )
is given by
pij =
1
M
#{l : ai(l) = j}
1
2. (3 marks) A popular cluster measure is the within groups sum of squares
S1(X1, . . . ,Xk) =
k∑
j=1
∑
x,y∈Xj
‖x− y‖22. (1)
Here ‖ · ‖2 is the Euclidean distance.
Calculating S1 using the formula (1) requires O(n2) operations, where n = |X | is the total
number of points. By using the centroids
cj =
1
|Xj |
∑
x∈Xj
x
show how to calculate S1 in O(n) operations.
This is one reason why the within groups sum of squares is a popular cluster measure.
3. (3 marks) Explain why minimising S1 is equivalent to maximising the between groups
sum of squares:
S2(X1, . . . ,Xk) =
k∑
j=1
∑
x∈Xj ,y 6∈Xj
‖x− y‖22.
4. (8 marks) Now consider maximising the following cluster measure
S3(X1, . . . ,Xk) =
k∑
j=1
∑
x∈Xj ,y 6∈Xj
‖x− y‖1,
where ‖ · ‖1 is the L1 norm.
Implement the CE algorithm using this objective function and apply it to the data set
jain.txt, available on Learning Central (taken from Jain & Law (2005) Data Clustering:
A User’s Dilemma). Do it for k = 2, 3, 4 and graph your solutions using different coloured
points to indicate the clusters.
Make sure that you write your code so that ‖x−y‖1 is only calculated once for each pair
x,y ∈ X .
Include a visualisation to check the convergence of P .
How does the quality of your solution depend on the number of solutions generated at each
iteration, the proportion of these selected for updating P , and the smoothing parameter?
Your answer should be backed up by numerical results.
2 marks are reserved for presentation, clarity of expression, and documentation of code.