Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
F19 17601 Data-driven Software Engineering
This HW includes both theory and implementation problems. Please note,
SVM&Kernel Methods
Problem 1 (10 points) In this problem we would like to compare the solutions of hard and soft SVMs on a linearly seperable dataset.Let n > 1 be a fixed number. Is it true that there exists C > 0 such that for every sample S of n training examples with binary label, which are linearly separable, the hard-SVM and the soft-SVM (with parameter C) solutions will return exactly the same weight vector.Justify your answer. (Hint: consider n = 2, d = 1 and S = {(x1, y1),(x2, y2)}. Let a > 0 and consider x1 = a, y1 = 1, x2 = −a, y2 = −1. Derive the optimal solution for hard and soft SVM and compare the results.)
Problem 2 (10 points) In this problem you are asked to find the explicit mapping function Φ(·) for the Gaussian kernel k (x, z) = exp by showing that it can be expressed as the inner product of an infinite-dimensional feature space by a mapping Φ.
a) Assume x, z ∈ R1 (only one feature per training example). Use the Taylor expansion of the kernel function κ (to be precise, the exp(·) function) and derive the explicit mapping Φ the kernel correspond to.
b) Answer the above question in general case where x, z ∈ Rd . Assume ∥x∥2 = ∥z∥2 = 1.
Problem 3 (20 points) In this problem we aim at utilizing the kerenl trick in Ridge regression and propose its kernalized version. Recall the Ridge regression training objective function:
for λ > 0.
a) Show that for w to be a minimizer of f(w), we must have X⊤Xw + λIw = X⊤y,where X ∈ Rn×d is the data matrix with n samples each with d features, and I is iden-tity matrix (please check lectures for more details). Show that the minimizer of f(w) is w =(X⊤X + λI)−1X⊤y.Justify that the matrix X⊤X + λI is invertible, for λ > 0.
(Hint: use SVD decomposition of data matrix X = UΣV ⊤ and show all the eigenvalues of X⊤X + λI are larger than zero).
b) Rewrite X⊤Xw + λIw = X⊤y as w =1/λ(X⊤y − X⊤Xw).Based on this, show that we can write w = X⊤α for some α ∈ Rn , and give an expression for α.
c) Based on the fact that w = X⊤α, explain why we say w is ”in the span of the data.”
d) Show that α =(λI + XX⊤)−1y. Note that XX⊤ is the n × n Gram (kernel) matrix for the standard vector dot product. (Hint: Replace w by X⊤α in the expression for α, and then solve for α.)
e) Give a kernelized expression for the Xw, the predicted values on the training points.(Hint: Replace w by X⊤α and α by its expression in terms of the kernel matrix XX⊤ .)
f) Give an expression for the prediction w⊤∗ x for a test sample x, not in the training set,where w∗ is the optimal solution. The expression should only involve x via inner products training data samples xi , i = 1, . . . , n.
g) Based on (f), propose a kernalized version of the Ridge regression.