Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Unsupervised Learning HW1
Due: Wed Oct 13, 2021 at 11:59pm All homeworks (including this one) should be typesetted properly in pdf format. Late homeworks or handwritten solutions will not be accepted. You must include your name and UNI in your home- work submission. To receive credit, a typesetted copy of the homework pdf must be uploaded to Gradescope by the due date. You must show your work to receive full credit. Discussing possible solutions for homework questions is encouraged on the course discussion board and with your peers, but everyone must write their own individual solutions. You must cite all external references you used (including the names of individuals you discussed the solutions with) to complete the home- work. 1 Readings Let x be the last digit of your UNI, and define y := (x mod 5) + 1. Read the yth paper from the list below (yes, we’ll check). Summarize the main results of your assigned paper, discuss their significance and provide a short proof sketch of their technical results. (i) “Clustering with Interactive Feedback” by Balcan and Blum (ii) “Incremental clustering: the case for extra clusters” by Ackerman and Dasgupta (iii) “Comparing Clusterings – An Axiomatic View” by Meila (iv) “Hartigan’s Method: k-means Clustering without Voronoi” by Telgarsky and Vattani (v) “Clustering large graphs via the singular value decomposition” by Drineas, Kannan, Frieze, Vempala, and Vinay 2 Hardness of k-center Show that the k-center problem is NP-hard. (Hint: You can use the fact that the following variation of the Vertex-Cover problem is NP-hard. Vertex-Cover* - Input: An undirected unweighted graph G = (V,E), with vertices V and edges E. - Output: V ′ ⊆ V , such that V ′ ∪ ( ∪v′∈V ′ ∪e(v′,v)∈E v ) = V . - Goal: minimize |V ′|. ) 1 3 Coverings and Packings Given a metric space (X, d), and an > 0, a set C ⊂ X is called an -cover if ∀x ∈ X,∃c ∈ C such that, d(x, c) ≤ , a set P ⊂ X is called an -packing if ∀p, p′ ∈ P distinct, d(p, p′) > . (i) Show that for any compact metric space (X, ρ) and any > 0, there exists Y ⊂ X , such that Y is both an -cover and an -packing. -covering number of X , denoted by N(X), is defined to be the size of smallest -cover of X , similarly -packing number of X , denoted by P(X), is defined to be the size of largest -packing of X . (ii) Show that for any metric space (X, ρ) and any > 0, P(X) ≤ N/2(X) ≤ P/2(X). (iii) LetBd(x, r) denote the closed Euclidean ball of radius r centered at x inRd, that isBd(x, r) := {p ∈ Rd | ‖x− p‖2 ≤ r}. Give an estimate of the -covering number of Bd(0, 1) (ie, closed Euclidean unit ball in Rd, centered at the origin). (Hint: consider a maximal size packing of Bd(0, 1) and compare the relative d-dimensional volumes of balls centered at elements of the packing with a ball containing all of them, using the fact that vol(Bd(x, r)) = rdvol(Bd(x, 1) ) (iv) [Application: approximating maximum singular value using a cover] Given a m × n matrix A, recall that the maximum singular value of A is defined as σmax(A) := max x∈Rn,‖x‖=1 ‖Ax‖ An (impractical) solution to finding σmax(A) is to test the condition ‖Ax‖ on each x ∈ Sn−1 := {x ∈ Rn : ‖x‖ = 1} and return the maximum value. The obvious problem with this approach is of course that the set Sn−1 is of infinite size and thus one cannot return σmax in finite time using such an approach. One obvious remedy is to approximate σmax by testing the value ‖Ax‖ on a finite (albeit large) collection of x and approximates Sn−1 well. More concretely, one can construct a cover C of Sn−1, and find the maximum ‖Ax‖ on the each x ∈ C (let’s denote that as σC) and relate it to σmax. Give a tight upperbound on the approximation σmax as a function of σC , and the quality of the cover C. 4 Low-Dimensional Embeddings from Dissimilarity Data We often encounter problems in machine learning where we do not have access to the data directly, but instead to dissimilarity ratings or comparisons between our datapoints. For example, in a series 2 of medical trials for drug development, we do not have access to a Euclidean representation of the drugs in question, but we may have access to differential trial data which compared the performance of different drugs across a population. For another more concrete example, consider distances be- tween cities. Given interpoint distances between n cities, we’d like to be able to reproduce the 2-dimensional global positions of the cities in question. As a third example, at a wine tasting, you may know only the relative quality or character of each wine, represented as a set of ratings, but you may want to find an embedding of the wines for clustering or visualization purposes, according to these ratings. We will explore how this can be done. Mathematically, we are given dissimilarity ratings in an n × n matrix D ∈ Rn×n where Dij = d(αi, αj) 2 for some data α1, ...αn (which we do not have access to). We’d like to find a k-dimensional Euclidean representation x1, ..., xn ∈ Rk such that n∑ i6=j ( Dij − ‖xi − xj‖2 )2 is minimized, i.e. the learned (squared) Euclidean distance is as close as possible to the given distance Dij . (i) First, we will show that, if the underlying data α1, ..., αn is Euclidean in Rn (i.e. there exists a perfect embedding such that ‖αi − αj‖2 = Dij for all i, j), we can recover this embedding exactly from the D matrix alone. First, we would like to transform the data matrix Dij into a set of inner products of the form = 〈αi − α, αj − α〉 where x represents the data average. Let H = I − 1n11T . Show that − 12HTDH has the desired form. This is called a Gram Matrix. Hint: ‖αi − αj‖2 = 〈αi, αi〉 + 〈αj , αj〉 − 2〈αi, αh〉. Also try expanding both sides and matching up terms. (ii) Assume the matrixBij is in this form. LetQ be the matrix whose columns are the eigenvectors of B, and Λ1/2 the diagonal matrix whose diagonal entries are roots of the corresponding eigenvalues. Show that the rows of the matrix QΛ1/2 ∈ Rn×n are a perfect (isometric) embedding of the original data into Rn. Hint: First prove that B is positive semi-definite. What does this imply? It turns out that the data matrix is in fact isometrically embeddable in Rn if and only if the Gram matrix is positive semi-definite. (iii) What if we want a lower-dimensional embedding instead? Show that if we can take the top k eigenvectors Qk and corresponding eigenvalues Λk of the centered matrix B, the rows of QkΛ 1/2 k minimize the loss n∑ i 6=j ( Dij − ‖xi − xj‖2 )2 over all possible k-dimensional embeddings xi ∈ Rk (this is the same as PCA on the original matrix X). 3 Hint: You may wish to use SVD and the Eckart-Young theorem, or (requiring slightly more work) rewrite this equation as the PCA objective function. Rewriting the objective using the Frobenius norm as min rankQ