A33648 Data Mining and Machine Learning
Data Mining and Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
A33648 Specified calculator
Page 1 of 8 TURN OVER
Calculators may be used in this examination but must not be used to store text.
Calculators with the ability to store text should have their memories deleted prior to
the start of the examination.
Special Requirements: None
Department of Electronic, Electrical and Systems
Engineering
Level M
04 30058
Data Mining and Machine Learning
Time Allowed: 2 hours
Answer THREE questions
The allocation of marks within each question is stated in the right-hand
margin.
Specified calculator
A33648 Page 2 of 8 TURN OVER
1. (a) Consider the following six short texts 1, 2, 3, 4, 5, 6:
1: He famously said that his priorities were education, education and
education.
2: The education secretary was Mr Gove.
3: Some parents teach their children at home.
4: Home schooling can be a challenge for parents.
5: The school education system needs reform.
6: Education is a priority.
After stop word removal and stemming these become the following
documents:
1: famous priority educate educate educate
2: educate secretary gove
3: parent teach child home
4: home school challenge parent
5: school educate system reform
6: educate priority
(i) Calculate the Inverse Document Frequency (IDF) for each term in the
documents and the TF-IDF weights for each term relative to documents 1
and 5.
[6]
(ii) Now consider the query “the education system”, which after stop word
removal and stemming, has the form:
: educate system
Calculate the lengths of the documents 1 and 5 and the query .
[4]
(iii) Calculate the similarities between the query and the documents 1 and
5.
[4]
(iv) How many times would the term “educate” need to appear in to make 1
more similar than 5 to ?
[4]
(v) If the terms “teach” in 3 and “school” in 4 are both replaced by the term
“educate”, how many times would the term “educate” need to appear in
to make 1 more similar than 5 to ?
[2]
Specified calculator
A33648 Page 3 of 8 TURN OVER
2. (a) Let = {1, … , 8} be the following set of points in the plane:
1 = [
−3
3
] , 2 = [
1
3
] , 3 = [
−1
2
] , 4 = [
2
4
],
5 = [
−2
0
], 6 = [
3
4
] , 7 = [
−1
3
] , 8 = [
2
2
],
and let = {1, 2} be two centroids:
1 = [
1
3.5
] , 2 = [
2
3
] ,
that represent .
(i) Calculate the new values of 1 and 2 after one iteration of -means
clustering ( = 2). Use the Euclidean (2) metric for distance calculations.
You must show the steps in your calculations.
[5]
(ii) Carefully draw a scatter plot of the points , the original centroids and the
new centroids. Based on this plot, would the centroids change or remain
unchanged after a further iteration of -means clustering? Explain your
answer.
[4]
(b) Consider 2-dimensional data being modelled using a Gaussian Mixture
Model (GMM) with two mixture components and diagonal covariance
matrices. The parameters of the GMM components are:
Gaussian 1: mean 1 = [
4
6
], variance 1
2 = [
1
1
], and weight 1 = 0.6
Gaussian 2: mean 2 = [
8
3
], variance 2
2 = [
1
4
], and weight 2 = 0.4
Calculate the probability of the sequence of data = ([
4
5
] , [
8
4
]) being
generated by the GMM. Show your working.
[4]
(c) Latent Semantic Analysis (LSA) is applied to a set of documents with a
vocabulary of different words. The Word-Document matrix is
decomposed into = using Singular Value Decomposition.
(i) In this context, explain what is meant by the Word-Document matrix. [2]
(ii) What are the dimensions and properties of the matrices , and ? [3]
(iii) Explain how the latent semantic classes that emerge from LSA can be
interpreted.