Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ELEN 4720: Machine Learning for Signals
Please read these instructions to ensure you receive full credit on your homework. Submit the writ- ten portion of your homework as a single PDF file through Courseworks (less than 5MB). In addition to your PDF write-up, submit all code written by you in their original extensions through Courseworks (e.g., .m, .r, .py, .c). Any coding language is acceptable, but do not submit notebooks. Also, do not wrap your files in .rar, .zip, .tar and do not submit your write-up in .doc or other file type. When resubmitting homeworks, please be sure to resubmit all files. Your grade will be based on the contents of one PDF file and the original source code. Additional files will be ignored. We will not run your code, so everything you are asked to show should be put in the PDF file. Show all work for full credit. Late submission policy: Late homeworks will have 0.1% deducted from the final grade for each minute late. Your homework submission time will be based on the time of your last submission to Courseworks. I will not revert to an earlier submission time! Therefore, do not re-submit after midnight on the due date unless you are confident the new submission is significantly better to overcompensate for the points lost. Submission time is non-negotiable and will be based on the time you submitted your last file to Courseworks. The number of points deducted will be rounded to the nearest integer. Problem 1 – 20 points In this problem you will derive a naive Bayes classifier. For a labeled set of data (y1, x1), . . . , (yn, xn), where for this problem y ∈ {0, 1} and x is a D-dimensional vector of counts, the Bayes classifier observes a new x0 and predicts y0 as y0 = arg maxy p(y0 = y|pi) ∏D d=1 p(x0,d|λy,d). The distribution p(y0 = y|pi) = Bernoulli(y|pi). What is “naive” about this classifier is the assumption that all D dimensions of x are independent. Assume that each dimension of x is Poisson distributed with a Gamma prior. The full generative process is Data: yi iid∼ Bern(pi), xi,d|yi ∼ Pois(λyi,d), d = 1, . . . , D Prior: λy,d iid∼ Gamma(2, 1) Derive the solution for pi and each λy,d by maximizing pi, λ̂0,1:D, λ̂1,1:D = arg max pi,λ̂0,1:D,λ̂1,1:D n∑ i=1 ln p(yi|pi)+ D∑ d=1 ( ln p(λ0,d) + ln p(λ1,d) + n∑ i=1 ln p(xi,d|λyi,d) ) . Please separate your derivations as follows: (a) Derive pi using the objective above. (b) Derive λ̂y,d using the objective above, leaving y and d arbitrary in your notation. Problem 2 – 30 points In this problem you will implement the naive Bayes classifier derived in Problem 1 and the logistic re- gression algorithm. The data consists of examples of spam and non-spam emails, of which there are 4600 labeled examples. The feature vector x is a 54-dimensional vector extracted from the email and y = 1 indicates a spam email.1 In every experiment below, randomly partition the data into 10 groups and run the algorithm 10 different times so that each group is held out as a test set one time. The final result you show should be the cumulative result across these 10 groups. (a) Implement the naive Bayes classifier described above. In a 2× 2 table, write the number of times that you predicted a class y data point (ground truth) as a class y′ data point (model prediction) in the (y, y′)-th cell of the table, where y and y′ can be either 0 or 1. There should be four values written in the table in your PDF. Next to your table, write the prediction accuracy—the sum of the diagonal divided by 4600. (The sum of all entries in the table should be 4600.) (b) In one figure, show a stem plot (stem() in Matlab) of the 54 Poisson parameters for each class averaged across the 10 runs. (This average is only used for plotting purposes on this homework. In practice you would relearn these parameters using the entire data set to find their final values.) Use the README file to make an observation about dimensions 16 and 52. Next run logistic regression on the same data set. Set every yi = 0 to yi = −1 for this part. Also, be sure to add a dimension equal to +1 to each data point. (c) Implement the steepest ascent algorithm discussed in class. Use a step size η = 0.014600 . Run your algorithm for 1,000 iterations and plot the logistic regression objective training function L per iteration for each of the 10 training runs. Plot this in the same figure. (d) Finally, implement an algorithm called “Newton’s method” for logistic regression as follows: At iteration t, approximate the function L(w) ≈ L′(w) ≡ L(wt) + (w − wt)T∇L(wt) + 1 2 (w − wt)T∇2L(wt)(w − wt) Then set wt+1 = arg maxw L′(w). Derive the update for wt+1 for the logistic regression problem and implement and run this algorithm. Plot the objective function L on the training data as a function of t = 1, . . . , 100 for each of the 10 training runs. Plot this in the same figure. (e) In a 2 × 2 table, show the testing results using Newton’s method in the same way as shown in Problem 2(a). 1I’ve preprocessed the data already. Please use the data as-is for all of problem 2. More information about the meanings of the 54 dimensions of the data is provided in two accompanying files. Problem 3 – 25 points In this problem you will implement the Gaussian process model for regression. You will use the same data used for homework 1 to do this, which is again provided in the data zip file for this homework. Recall that the Gaussian process treats a set of N observations (x1, y1), . . . , (xN , yN ), with xi ∈ Rd and yi ∈ R, as being generated from a multivariate Gaussian distribution as follows, y ∼ Normal(0, σ2I +K), Kij = K(xi, xj) ( use: exp { − 1 b ‖xi − xj‖2 }) . Here, y is an N -dimensional vector of outputs and K is an N ×N kernel matrix. For this problem use the Gaussian kernel indicated above. In the lecture slides, we discuss making predictions for a new y′ given x′, which was Gaussian with mean µ(x′) and variance Σ(x′). The equations are shown in the slides. There are two parameters that need to be set for this model as given above, σ2 and b. a) Write code to implement the Gaussian process and to make predictions on test data. For b ∈ {5, 7, 9, 11, 13, 15} and σ2 ∈ {.1, .2, .3, .4, .5, .6, .7, .8, .9, 1}—so 60 total pairs (b, σ2)—calculate the RMSE on the 42 test points as you did in the first homework. Use the mean of the Gaussian process at the test point as your prediction. Show your results in a table. b) Which value was the best and how does this compare with the first homework? What might be a drawback of using the approach in this homework compared with homework 1? c) To better understand what the Gaussian process is doing through visualization, re-run the algorithm by using only the 4th dimension of xi (car weight). Set b = 5 and σ2 = 2. Show a scatter plot of the data (x[4] versus y for each point). Also, plot as a solid line the predictive mean of the Gaussian process at each point in the training set. You can think of this problem as asking you to create a test set by duplicating xi[4] for each i in the training set and then to predict that test set.