Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Instructions: Compile all solutions to the written problems on this assignment in a single PDF file. Show your work by writing down relevant equations or expressions, and/or by
explaining any logic that you are using to bypass known equations. For coding problems,we recommend that you write, label, and comment all of your code in one Jupyter notebook fifile.Follow the submission instructions to submit all files to Gradescope. Please be mindful of the deadline, as late submissions are not accepted, as well as our course policies on academic honesty.
You will be working with the following dataset. There are three trinary features x1, x2, and x3, and two classes 0 and 1.
Sample x1 x2 x3 y
1 +1 −1 −1 0
2 +1 0 +1 0
3 0 0 −1 0
4 0 −1 +1 0
5 +1 +1 +1 1
6 0 +1 0 1
7 0 0+1 1
8 −1 0 +1 1
Problem 1: Naive Bayes (25 points)
(a) Write a small program “training” a naive Bayes model using the provided data. You can simply hardcode the parameters since they can be computed by inspection, but we recommend that you store them neatly and also allow for smoothing to be added in a later part. Which parameters are 0 with no smoothing (↵ = 0)?
(b) Identify all combinations of feature inputs for which our model will predict zero likelihood for both classes using the parameters learned above.
(c) Compute the distribution Pr(Y |x1, x2, x 3) for each training sample. Show a plot of the eight distributions (e.g., sample on x-axis, Pr(Y = 0) and Pr(Y = 1) as two separate lines on y-axis).
(d) Retrain the naive Bayes parameters using Laplace smoothing with ↵ = 1. Repeat part (c) with the new smoothed parameters. Brieflfly compare and contrast your observations about the distributions with and without smoothing.
(e) For either model, on which two training samples would we feel the most uncertain about our predictions? Brieflfly explain, referencing the specifific feature combinations.
Problem 2: Decision Tree (20 points)
Suppose we train a decision tree model using the provided training data.
(a) Show all information gain computations that would be considered when learning the model.You do not have to show all of the numerical calculations, but please show the expressions that would be used. Sketch the resulting decision tree (ties may be broken arbitrarily).
(b) Now consider training a “random” forest of three decision trees, where each tree is trained on all data using the three di↵erent subsets of two features, i.e., (x1, x2) only, (x1, x3) only, and (x2, x3) only. Sketch each of the learned trees (no need to show calculations). What is the training accuracy of each tree individually? What is the highest attainable training accuracy of the forest if we classify using a majority vote?
Problem 3: Linear Classififier (25 points)
Suppose we train a linear classififier with a hard threshold activation function using the perception algorithm on the provided training data. We predict y = 0 for hw < 0 and y = 1 for hw > 0. We will look at the implication of the tiebreaker decision being assigned to each class.
(a) Plot the data as points in x1-x2-x3 space, using di↵erent colors and markers to indicate the two dierent classes. Are the data linearly separable? You may have to try di↵erent viewpoints to get a full perspective of the space.
(b) Write a small program implementing perceptron, starting with weights w = (0, 0, 0, 0) and using learning rate ↵ = 1. Learn two models, one in which the tiebreaker hw = 0 predicts y = 0, and one in which it predicts y = 1. Track the weights over each iteration until convergence.Generate two plots, one for each model, of the four weight components (you can plot them as four separate lines in one graph). Remember that convergence is achieved only when the classifier is correct on a full pass through the training data.
(c) Experiment with increasing and decreasing the learning rate ↵ for each model. How does aect the fifinal learned weights and convergence speed?
(d) Compute and plot the sigmoid values of each of the training data using the two learned models (i.e., sample on x-axis and value on y-axis, one plot for each model). Which data predictions do we feel the most uncertain about, and what are their associated probabilities?