Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
1 General Knowledge (15 pts)
1. ( 3 pts) Derive an expression for expectedLoss involving Bias, variance, and noise.
2. ( 3 pts) Explain how to use cross-validation to estimate each of the terms above.
(b) How does the hinge loss function in an SVM handle bias?
(c) Which parameters of an SVM affect bias on test data? How does increasing or decreasing these parameters affect bias?
4. (5 pts) Consider a naive-Bayes generative model for the problem of classifying samples {(x1,y1),…,(xn,yn)}, xi ∈ Rp and yi ∈ {1,…,K}, where the marginal distribution of each feature is modeled as a univariate Gaussian, i.e., p(xij|yi = k) ∼ N(µjk,σ2
jk),
where k represents the class label. Assuming all parameters have been estimated, clearly describe how such a naive-Bayes model will do classification on a test point xtest.
Imagine we are using 10-fold cross-validation to tune a parameter θ of a machine learning algorithm using training set data for parameter estimation, and using the held-out fold to evaluate test performance of different values of θ. This produces 10 models,{h1,…,h10}; each model hi has its own value θi for that parameter, and corresponding error ei. Let k = arg mini ei be the index of the model with the lowest error. What is the best procedure for going from these 10 models individual to a single model that we can apply to the test data?
Midterm, CSCI 5525 2021
3. (4 pts) Bias in a classifier means that the probability of classifying a new data point drawn from the same distribution as the training data will result in one category occurring more often than another.
(a) What aspects of the training data affect classfier bias?
Paul Schrater
a) Choose the model hk?
b) weight the predictions of each model by wi = exp(−ei)?
c) Set θ = θk, then update by training on the held-out data.
Clearly explain your choice and reasoning.
Let {(x1,y1),…,(xn,yn)} be a dataset of n-samples for regression with xi ∈ Rp and yi ∈ R.
Consider a regularized regression approach to the problem:
This problem is known as ridge regression. Using a kernel trick we can write the solution to this problem in a form where we can use weights on either the feature dimensions or the data points. Rewriting the expression in terms of data points allows us to generalize the regression solution to nonlinear forms. To better understand the problem, remember that a data matrix X can be viewed in either row or column form, where rows are data points and columns feature dimensions. A regression solution weighting rows is suitable for a kernel form of a solution.
1. (5 pts) using notation y ∈ Rn for the vector of responses generated by stacking labels into a vector, and X ∈ Rnxp the matrix of features, rewrite the objective function above in vector matrix form, and find the a closed form solution for w∗. Is the solution valid for n < p?
2. (5 pts) Show that the solution can be kernelized (i.e. that w∗ = Pi=1:n αik(xi,·) ) for
some function k(x,·) you need to derive. The trick is the derivation is a matrix inverse identity: (A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1. In your application,
X = B, C = I and A = λI. The point of using the inverse is to convert your solution from it’s standard form into one where you use XXT, which creates an inner product matrix of size data point by data point. By applying the resulting solution to a new feature vector x, show that wTx can be written in kernel form as above.
3. (5 pts) Use the kernelization result to derive ridge regression for fitting polynomials to order m using a polynomial kernel function.
1. (5 points) Which features are predictive? How can you tell? Explain your reasoning.
2. (10 points) Recode the features so that they could be used with naive Bayes or logistic regression. Using only the single best feature using naive Bayes, what should I do with a new data point, a cute puppy named Joey? Joey has 3 weeks potty training, costs 0 dollars, has 0 instances of previous carpet damage and is Brown/White. Adopt or no?