CSE 5523: Machine Learning
Version:0.9 StartHTML:0000000105 EndHTML:0000087257 StartFragment:0000000141 EndFragment:0000087217 Instructions (Important) • Total points = 55. Full Mark = 50. • Read each problem statement carefully. Make sure that you answer exactly what is required and that your solution is expressed clearly. • Some notation introduced in class, e.g., the notation we used for true risk, empirical risk, the output of a learning algorithm given a training set, parameter vectors, etc., will be used here directly without being defifined again. Please, refer to the notes if you have any confusion about the notation. Also, here k · k will be used to denote the Euclidean (i.e., L2) norm. • You may use any result covered in class, but please cite the result that you use. Problem 1: Linear Hard-Margin SVMs (15 points) In this problem, refer to the fifigure attached in the last page. In the attached fifigure, consider the constellation of feature vectors x i = (x (i) 1 , x (i) 2 ) ∈ R2 and their assigned labels y(i) , i = 1, . . . , 7 (see the color/label code in the top-right corner of the fifigure): 1. Find the margin-maximizing linear classififier; that is, fifind (ω, b) = ((w1, w2), b) describing the equation of the plane w1x1 + w2x2 + b = 0 of the largest margin with respect to the given constellation. Express (ω, b) after performing the proper normalization such that min i∈{1,...,7} y(i) â¹°?hω, x(i) i + b = 1. [Note: You do not need to solve the optimization problem for the SVM or prove formally that your plane is margin-maximizing. It is only required to fifind the parameters of the margin-maximizing classififier (ω, b) as instructed above. You may want to provide an intuitive explanation of why you think your solution is optimal, for which you may get partial credit in case your solution is incorrect.] 2. Given your solution in Part 1: (a) Identify the set Q ⊆ {x (1) , . . . , x (7) } of points that are closest to your plane. (b) Find a set of scalars {αi : i ∈ Q} that satisfy all of the following (showing all details of your derivation): ∀ i ∈ Q, αi ≥ 0, Xi∈Q y (i) αi = 0, ω = Xi∈Q αiy (i) x (i) (c) What are the support vectors of your classififier?Problem 2: Solving a Kernel-Based SVM with SGD (18 points) This problem illustrates a simple example for solving a kernel-based SVM with Gaussian kernel via SGD. Consider a training set of 3 data points: (x(i) , y(i) ) ∈ R2 × {− 1, +1 } , i = 1, 2, 3, where x (1) = 0 0 , y (1) = +1 x (2) = 2 1 , y (2) = 1 x (3) = 2 1 , y (3) = 1 Let K : R2 × R2 → R be a Gaussian kernel defifined as: K(x, x0 ) = exp 1 8 k x x0 k 2 for any x, x0 ∈ R2 . Suppose we want to solve the kernel-based SVM as described in class with respect to this training set: 1. Obtain the Kernel matrix with respect to the given training set. Hence, write down the optimization problem corresponding to solving the soft-margin kernel-based SVM, i.e., write down an expression for the objective function (the regularized loss) that needs to be minimized over the variable coeffiffifficients α = (α1, α2, α3) as discussed in class. 2. Set the regularization parameter Λ in the objective function of Part 1 as Λ = 1. Now, suppose we want to solve this optimization problem via SGD as discussed in class. For simplicity, consider SGD with total number of updates T = 6, i.e., 5 update steps not including initialization, which is assumed to be at α(1) = 0. Showing all intermediate steps, perform those updates and obtain the fifinal coeffiffifficient-vector α b (i.e., show all updates α(t) , t = 1, . . . , 6, as well as the fifinal output α b ). Note that in the actual execution, in each step, SGD samples one data point uniformly at random from the training set. For the sake of this problem, assume that over the course of the 5 update steps after initialization, SGD samples the folowing data points in the following order: at t = 1, x (1) , y (1) is sampled at t = 2, x (2) , y (2) is sampled at t = 3, x (3) , y (3) is sampled at t = 4, x (2) , y (2) is sampled at t = 5, x (3) , y (3) is sampled 3. Given the coeffiffifficient-vector α b obtained in Part 2, (a) give an expression for the resulting kernel-based classififier, (b) suppose you are given a new feature-vector x = (0, 1) ∈ R2 . What label will your kernel classififier generate for x? Show your derivation. Problem 3: Stability (10 points) We say that a learning algorithm A is an AERM (Asymptotic Empirical Risk Minimizer ) for the class H with excess empirical risk (n) if for every distribution D it holds that E S∼Dn L b ( A (S); S) min h∈H L b (h; S) ≤ (n).We say learning algorithm A learns a class H with excess true risk (n) if for every distribution D it holds that E S∼Dn L( A (S); D) min h∈H L(h; D) ≤ (n). Prove that if a learning algorithm A is 1(n)-On-Average-Replace-One stable, and is an AERM with excess empirical risk 2(n) for the class H , then it learns H with excess true risk 1(n) + 2(n). Problem 4: Noisy Gradient Descent (12 points) Let M, Ï > 0. Let C ⊂ Rd be a closed convex set where max w∈C k w k ≤ M. Let f : Rd → R be a convex, Ï-Lipschitz function. Consider a variant of the basic (non-stochastic) gradient descent algorithm discussed in class, where at each update step, we get a noisy version of the gradient. Namely, at each iteration t = 1, . . . , T 1, the update step is given by: wt+1 = Î C (wt α ( ∇ f(wt) + vt)), where vt ∼ N 0, σ2Id , t = 1, . . . , T 1, are independent Gaussian random variables (Id is the identity matrix of size d), and Î C is the Euclidean projection onto C discussed in class. You may assume the standard initialization: w1 = 0. Show that if α = √ M (Ï2 +dσ2 )T , then the output of the algorithm w b = 1 T P T t=1 wt after T iterations satisfifies E [f(w b )] min w∈C f(w) ≤ M p Ï2 + dσ2 √T , where the expectation is taken w.r.t. the Gaussian noise variables vt, t = 1, . . . , T 1. [Note that the problem here is an optimization problem not a learning problem, i.e., there is no training data or the notion of excess risk. The goal is to minimize an objective function f over C ⊂ Rd via this noisy version of the basic gradient descent algorithm.] [Note that the bound to be proved is on the expectation of the optimization error. In your derivation, you need to take the expectation w.r.t. the randomness due to the Gaussian noise.](1, 5) + 1 label - 1 label (4, -2) ( - 1, 1) x2 x1 (-1, 4) (-2, -5) (1, - 1) (-5, 3) x (1) = x (2) = x (3) = x (5) = x (4) = x (7) = x (6) =