Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MSIN0010 Data Analytics I
Homework 3
1. Use the Credit data set from the ISLR package in R to answer the following
questions.
a. (2 points) Estimate a linear regression model with Balance as the response
variable and Income, Cards, Age, Student, and Education as
predictors. Include the model output below.
b. (2 points) Which variables significantly affect credit card balance at the 5%
level? (Note: you can ignore the intercept.)
c. (2 points) The p-value on Cards is equal to 0.0141. Explain what this number
represents.
d. (2 points) Use the rpart() function to estimate a regression tree that
relates credit card balance to the same set of predictors as part (a). Include a
picture of the estimated tree below.
e. (2 points) Based on your results in part (d), what variable is most important in
predicting credit card balance? Which variable(s), if any, were NOT used at all
in building the tree?
f. (2 points) Do the two models tend to agree or disagree on which variables
matter the most in predicting credit card balance?
2. In this question you will use your own simulated data to compare the flexibility of
linear regression and regression trees.
a. (3 points) Generate n=1000 observations from a linear regression model with
two variables (Y and X) where the true relationship between Y and X is
nonlinear. You can generate nonlinearities in a variety of ways, but maybe
the easiest way is to include an X2 term in the model used to generate the
data (but feel free to use your own ideas here). It is up to you to choose true
parameter values and the distribution of the X variable. Assume that the
error term in your model follows a standard normal distribution. Include your
R code used to generate the data below.
b. (2 points) Make a scatterplot of Y on X to illustrate this nonlinear relationship
in your data.
c. (5 points) Estimate a linear model (without any higher-order nonlinear terms)
and a regression tree to your data. Then calculate the in-sample RMSE for
each model. Comment on which model performs better and why.
3. Consider the following data set on spam email classification. The spam variable
indicates whether the email is actually spam or not, n_foreign reports the
number of foreign characters used in the message, and cap_run_length counts
the longest string of capital letters used in the message.
ID spam n_foreign cap_run_length
1 yes 10 15
2 no 2 4
3 no 5 8
4 yes 4 10
5 no 4 2
a. (6 points) Suppose you observed the following data on a new email.
ID Spam n_foreign cap_run_length
6 ? 6 10
Use the 1st - nearest neighbors algorithm (WITHOUT rescaling the data) to
classify this email as spam or not. Show your work.
b. (2 points) Now use your work from part (a) to find the 3rd - nearest neighbors
classifier of the new email (using a majority voting function). What is the
probability of predicting the new email as spam? Is your prediction different
from the 1st - nearest neighbor classifier in part (a)?
4. Consider the following credit card transaction data set. There are three variables:
fraud indicates whether the transaction was fraudulent or not, amount indicates
whether the transaction amount was high or low, and location indicates whether
the transaction was made domestically or in a foreign country.
ID fraud amount location
1 no high domestic
2 yes high foreign
3 no low foreign
4 no high domestic
5 yes low foreign
6 no high domestic
7 no low foreign
8 no low domestic
9 no high foreign
10 yes high foreign
a. (6 points) Determine whether a classification tree should first split on
amount or location.
b. (1 points) Draw the decision tree that corresponds to your analysis in part (a).
Make sure that each terminal node includes a prediction of whether fraud
equals yes or no.
c. (3 points) Suppose that the final model is the classification tree you defined
above in parts (a) and (b). What is the model’s in-sample error rate? What is
the model’s in-sample false positive and false negative rates?