Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Clustering Analysis
To carry out this assessed coursework, you will need the Python notebook, clustering.ipynb,
which contains code to get you started with each of the assignments, along with several data files
required by dierent assignments (see the related assignments for details of those data sets) in this
coursework. All above are available in a zipped file on BlackBoard alongside this document.
Clustering analysis in its own right is a classical topic in unsupervised learning to learn a high-level
summary of data, which provides a representation of data from a global perspective. In this course-
work, you are asked to implement state-of-the-art clustering algorithms in Python and apply your
implementations along with the provided Python implementation of clustering algorithms to sev-
eral meaningful synthetic datasets for clustering analysis.
1 K-means Clustering Analysis
K-means is one of the most commonly used clustering analysis algorithm. In this work, you are
asked to apply the providedK-means functions to synthetic datasets to understand its behaviour
and how to use a cluster validity index to decide the proper number of clusters underlying a data
set.
Assignment 1 [3 marks] Apply the built-inK-means function, sklearn.cluster.KMeans, in the
scikit-learn library with Euclidean distance to the 2-D dataset of 3 clusters,kmeans_data_1.npy,
with three dierent initial conditions given in the notebook, clustering.ipynb. In your answer
notebook, (a) implement the function, partition, that produces a partition for a given dataset so
that the data points in this dataset are grouped into clusters based on their closest centriods. All
the points in each cluster are assigned the same label specified by their corresponding centroid; (b)
apply the built-inK-means function to the dataset mentioned above for clustering analysis with 3
given dierent initialisation settings, respectively; and (c) based on the clustering analysis results
achieved in (b), use your partition function implemented in (a) to create 6 scatter plots corre-
sponding to 3 initial partitions and 3 final partitions in order to visualise 6 partitions. You are asked
to display 6 plots in a 2×3grid; the first row shows 3 initial partitions and the second row shows 3 fi-
nal partitions aligned with their corresponding initial partitions shown in the first row. In each scat-
ter plot, you must mark the cluster centroids with the red-coloured "+", all the data points in a clus-
ter must be marked with the same colour (but dierent from red to allow for seeing their centroid
clearly) and dierent clusters must be indicated by dierent colours. (Hint: to carry out the display
format described above, you may use the built-in function, matplotlib.pyplot.subplot.)
Assignment 2 [3 marks] K-means algorithm cannot be used until the hyperparameter K (the
number of clusters) is set up. In a real application, however, the number of clusters is oen un-
*Assessed Coursework: the deadline and requirements can be found at the end of this document.
known. In this circumstance, the scatter-based F-ratio index1 may be applied to decide the num-
ber of clusters. In your answer notebook, (a) implement the scatter-based F-ratio index in Python
where Euclidean distance is used; (b) forK = 2, 3, · · · , 10, run the theK-means built-in function,
sklearn.cluster.KMeans, in the scikit-learn library with Euclidean distance on the dataset,
kmeans_data_2.npy (for each K, you must run K-means with 3 dierent random initialisation
conditions set by yourself), then plot F-ratio index (y-axis) versus K (x-axis) (in this plot, for each
K, use only the least F-ratio index value measured on 3 partitions resulting from dierent initial-
isation conditions), and report the optimal number of clusters in this data set; and (c) display the
final partition corresponding to the optimal number of clusters you find out in (b) with the the same
display format described in Assignment 1.