STATS 102B: Introduction to Computation
Introduction to Computation
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STATS 102B: Introduction to Computation and Optimization for
Statistics
Introduction
I The aim of cluster analysis is to create a grouping of objects
such that objects within a group are similar and objects in
different groups are not similar. Before we look at the method
in more detail, we first motivate cluster analysis with an
example.
I Customer preference: Imagine you run a large online store and
would like to personalise users’ shopping experience. You hope
that by improving their shopping experience, users will buy
more.
I One way to do this is to provide each user with a set of unique
recommendations that they see when they access your site.
You do not directly know each user’s personal preferences and
tastes but you do have lots of data.
I Assuming that we can define a measure of similarity between
customers based on their purchasing history, we could use
cluster analysis to group customers into K groups.
I Within each group, customers have similar shopping patterns.
Differences between customers in the same group could form
the basis of a recommender system.
I A recommender system could also be created by clustering the
items based on the customers they were bought by. If items 1
and 2 were bought by customers A, D, E and G, then they
could be considered similar. Customers could then be
recommended items that were similar to items they had already
bought.
Synthetic dataset for clustering examples
Consider the data shown below. It consists of 200 observations,
x1, . . . , x200, each represented by two atributes, xT =[x1, x2].
0 5 10
−
5
0
5
10
x1
x2
K -Means Clustering
I This method was presented by MacQueen (1967) in the
Proceedings of the 5th Berkeley Symposium on Mathematical
Statistics and Probability.
I MacQueen suggests the term K-means for describing an
algorithm of assigning each observation to the cluster having
the nearest centroid (mean).
I By clustering the data, we have implicitly defined what similar
means — similar observations are those those that are close to
one another in terms of the Euclidean distance.
I There are other ways of defining similarity that may be more
appropriate, for example, the Mahalanobis distance
d(x , y) =
√
(x − y)TS−1(x − y),
K -Means Clustering Procedure
1. Partition the observations into K initial clusters.
2. Scan through the list of n observations, assigning each
observation to the cluster whose centroid (mean) is nearest. If
an observation is reassigned, we recalculate the cluster mean or
centroid for the cluster receiving that item and the cluster
losing that item.
3. Repeat Step 2 over and over again until no more reassignments
are made.
Simple Example
Suppose we measure two variables for each of four observations.
## item X1 X2
## 1 A 3 3
## 2 B 7 9
## 3 C 4 1
## 4 D 3 4
Our objective is to partition the items into two clusters such that
the observations within a cluster are closer to one another than they
are to the observations in different clusters.
Step 1: Compute Centroid
Assume we initially decide to partition the items into two clusters
(A, B) and (C, D). The centroid for cluster (A, B) is
## X1 X2
## 5 6
The centroid for cluster (C, D) is
## X1 X2
## 3.5 2.5
Step 2: Compute the Eclidean Distance
Compute the distance between A and the centroid of cluster (A, B)
## [1] 3.605551
Compute the distance between A and the centroid of cluster (C, D)
## [1] 0.7071068
Here, we see that A is closer to cluster (C, D) than cluster (A, B).
Therefore, item A will be reassigned, resulting in the new clusters
(B) and (A, C, D).
Recompute Centroid
The centroids of the new clusters are calculated as follows:
## [1] 7 9
## X1 X2
## 3.333333 2.666667
We then calculate the distance between the observations and each
of the clusters.
## [,1] [,2] [,3] [,4]
## [1,] 7.211 0.00 8.54 6.40
## [2,] 0.471 7.32 1.80 1.37
K-Means Example
0 5 10
−
5
0
5
10
x1
x2
0 5 10
−
5
0
5
10
x1
x2
0 5 10
−
5
0
5
10
x1
x2
0 5 10
−
5
0
5
10
x1
x2
K-Means Example (Cont.)
0 5 10
−
5
0
5
10
x1
x2
Comments
I If two or more seed points inadvertently lie within a single
cluster, their resulting clusters will be poorly differntiated.
I The existence of an outlier might produce at least one group
with very disperse observations.
I Even if the population is known to consist of K groups, the
sampling method may be such that data from the rarest group
do not appear in the sample. Forceing the data into K groups
would lead to nonsensical clusters.
I It is always a good idea to retun the algorithm for several
choices.