Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT3888: Statistical Machine Learning
Time allowed: One hours (plus 10 minutes reading time) Note that these are practice questions and do not represent the length of the actual quiz. 1. Short answer section. (a) True or false. The training error of the K-nearest neighbour (KNN) clas- sifier with K = 1 is zero. Explain your answer. (b) True or false. The depth of an estimated decision tree can be larger than the number of training examples used to create the tree. Explain your answer. (c) Suppose you apply clustering on a data set with 5 observations, explain whether K-means or hierarchical clustering can produce the following 3 clusters: C1 = {1, 3, 4}, C2 = {2, 4}, and C3 = {5}. (d) Suppose you have within cluster correlation. Would it be more appropriate to use k-means clustering or a mixture of normals model to cluster your data? (e) True or false. The Bayes error rate is the lowest possible error rate, andcan never be zero. Explain you answer. (f) True or false. Regression trees can only model constant functions. Explain your answer. (g) What distribution is used to model the elements of the matrix in non- negative matrix factorization? (h) What distribution is used to model the space of the low dimensional pro- jection used in t-SNE? (i) State the difference between technically correct and consistent data as described in the course notes. (j) State a potential problem with removing samples corresponding to cate- gorical variables with a low number of instances. 2. [Unsupervised learning] (a) Six variables measured on 100 genuine and 100 counterfeit old Swiss 1000- franc bank notes. The data stem from Flury and Riedwyl (1988). The columns correspond to the following 6 variables. X1: Length of the bank note, X2: Height of the bank note, measured on the left, X3: Height of the bank note, measured on the right, X4: Distance of inner frame to the lower border, X5: Distance of inner frame to the upper border, X6: Length of the diagonal. Suppose that the mean vector of X has been centered so that X = 0. The vector of eigenvalues of S is (2.985, 0.931, 0.242, 0.194, 0.085, 0.035)T The eigenvectors of S are given by the columns of the matrix U = −0.044 0.011 0.326 0.562 −0.753 0.098 0.112 0.071 0.259 0.455 0.347 −0.767 0.139 0.066 0.345 0.415 0.535 0.632 0.768 −0.563 0.218 −0.186 −0.100 −0.022 0.202 0.659 0.557 −0.451 −0.102 −0.035 −0.579 −0.489 0.592 −0.258 0.085 −0.046 The first two principal components are plotted below where 1 indicates a counterfeit bank note and 0 indicates a genuine bank note. −2 −1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 Principal components 1st principal component 2n d pr in ci pa l c om po ne nt 00 0 0 0 00 0 0 0 0 0 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Using the above information answer the questions below: i. What percentage of the total variability is explained by each of the first two principal components? ii. Write equations for the first two principal components as functions of X1, . . . , X6. iii. Hence, interpret the first two principal components. Use the plot of the first two principal components to distinguish genuine from counterfeit bank notes in terms of the original variables. iv. Without knowing the variances of the measurements explicitly, would you anticipate large differences in principal component analyses based on the covariance matrix and principal component analyses based on the correlation matrix for this particular dataset? (b) Suppose that you have the following 6 observations and want to apply K-means with K = 2. Obs. X1 X2 1 2 5 2 1 2 3 0 3 4 4 1 5 5 2 6 3 −1 i. Suppose the initial cluster assignment is C1 = {1, 3, 4} and C2 = {2, 5, 6}. Calculate the cluster means. ii. Given the cluster means in (b) recalculate the cluster assignments. (c) Consider the following dissimilarity matrix for use in hierarchical cluster- ing. D = 0 0.4 0.2 0.9 0.4 0 0.9 0.7 0.2 0.9 0 0.5 0.9 0.7 0.5 0 i. Merge the two sets with the two points which are most similar. ii. Recalculate the dissimilarity matrix using average linkage after using the merge step in part (a). iii. What are the next two sets which should be merged? iv. Use these to construct a hierarchical clustering dendrogram. 3. [Classification] (a) Suppose that we have two classes Y = 0 and Y = 1 with densities f0(x) and f1(x) respectively, where f0(x) = 1/10 x ∈ [0, 2] 2/5 x ∈ [4, 6] 0 otherwise and f1(x) is a uniform density on [0, 6]. i. Find the Bayes classifier and the corresponding Bayes error when P (Y = 0) = 1/2 and P (Y = 1) = 1/2. ii. Find the Bayes classifier and the corresponding Bayes error when P (Y = 0) = 1/6 and P (Y = 1) = 5/6. (b) Two colleagues are arguing over who has built the best classifier for a particular task and they ask you to intervene in the situation. The first colleague shows you their code and you notice that they have performed variable selection outside the cross-validation loop used a neural network classifier and have obtained a cross-validation error of 20%. The second colleague shows you their code and you notice that they have performed variable selection inside the cross-validation loop, have used a diagonal discriminant analysis classifier and have obtained a cross-validation error of 30%. Which colleague is right? Why? (c) Consider again the Bank notes example in Question 1. Use the R code below to answer the following questions. i. Suppose that the you observe a bank note with measurements: X = (215.0, 130.0, 129.6, 7.7, 10.5, 140.7) How would classify such a bank note using the classification tree de- picted in the figure below? ii. Based on the code below. What is the cross-validation error. > library(rpart) > library(rpart.plot) > library(cvTools) > dat = read.table("BankNotes.txt") > X = data.matrix(dat) > colnames(X) = c("Length","HeightLeft", "HeightRight","Lower","Upper","Diagonal") > X = X[,-6] > # The first 100 notes are genuine, the second > # 100 notes are counterfeit > y = as.factor(c(rep(0,100),rep(1,100))) > df = data.frame(y,X) > res.rpart <- rpart(y ~ X, data = df) > rpart.plot(res.rpart,type=4,extra=1, main="Tree of Bank Data",cex.main=2) > n = nrow(X) > V = 10 > cvSets <- cvFolds(n, V) > testErrors = c() > for (cv in 1:V) { + testInds <- cvSets$subsets[which(cvSets$which==cv)] + trainInds <- (1:n)[-testInds] + df.train = data.frame(y=y[trainInds],X=X[trainInds,]) + df.test = data.frame(y=y[testInds],X=X[testInds,]) + res.rpart <- rpart(y~., data = df.train) + y.hat = predict(res.rpart, newdata = df.test, type="class") + testErrors[cv] = sum(y.hat!=y[testInds]) + } > testErrors [1] 1 0 0 2 0 1 1 1 1 1 Tree of Bank Data XLower < 9.6 XUpper < 11 >= 9.6 >= 11 0 100 100 0 98 14 0 94 1 1 4 13 1 2 86 4. (8 marks) Consider the following table containing 4 observations. j 1 2 3 4 Classj A B A B xj 0 1 0 1 yj 1 1 −1 −1 The second row represents a class taking two values ‘A’ and ‘B’ and the third and fourth row contain a two-dimensional (xj, yj) predictor related to the class values. (a) (6 marks) Use a leave-one-out cross-validation to estimate the classification error of a k = 1 nearest neighbour classifier. Answer this question by completing the following table in your exam booklet. point left out j = 1 j = 2 j = 3 j = 4 k closest point to this point j = 3 point classified as A (b) (2 marks) What is the leave-one-out cross-validation error for k = 1? 5. (10 marks) A retrospective sample of males in a coronary heart disease (CHD) high-risk region of the Western Cape, South Africa was collected. Samples were divided into cases (of CHD) and controls. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. The variables are: Variable Description sbp systolic blood pressure tobacco cumulative tobacco (kg) ldl low density lipoprotein cholesterol famhist family history of heart disease (Present, Absent) age age at onset CHD response Assume that the variables sbp, tobacco, ldl, famhist, bmi, alcohol, age and CHD have already been entered as vectors in R. Using the R code on the next page answer the following questions. (a) (4 marks) State the model corresponding to glmResults. (b) (6 marks) Using model glmResults what is the probability of CHD=1 for a patient with measurements tobacco=2, ldl=4.34, famhist=“Absent” and age=45? If we were to use this model as a classifier what class would we predict for this case? Consider the R code: > glmResults <- glm(CHD~tobacco+ldl+famhist+age,family=binomial) > summary(glmResults) Call: glm(formula = CHD ~ tobacco + ldl + famhist + age, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.7559 -0.8632 -0.4545 0.9457 2.4904 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.204275 0.498315 -8.437 < 2e-16 *** tobacco 0.080701 0.025514 3.163 0.00156 ** ldl 0.167584 0.054189 3.093 0.00198 ** famhistPresent 0.924117 0.223178 4.141 3.46e-05 *** age 0.044042 0.009743 4.521 6.17e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 596.11 on 461 degrees of freedom Residual deviance: 485.44 on 457 degrees of freedom AIC: 495.44 Number of Fisher Scoring iterations: 4 THIS IS THE LAST PAGE