Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ISE529 HW3
Note: Submit the .Rmd file and the output PDF file in blackboard before the deadline. We will grade your
last submitted work before the deadline.
Question 1
In this exercise, you will explore logistic regression. You will work with a synthetic dataset that simulates
customer data for a subscription service.
The goal is to predict whether a customer will subscribe (1) or not subscribe (0) to a service based on their
age and whether they are a student. The “student” variable is categorical (Yes/No) and will be used as an
indicator variable in your logistic regression model.
Tasks:
Note: For reproducibility, set the seed deliberately in your code for task 1. You can choose any number you
like but try one that is unique to you, such as the last four digits of your student ID.
1. Data Generation:
Create a synthetic dataset with 200 observations that include the following variables:
• age: A integer variable representing the customer’s age, randomly generated from a uniform distribution,
between 18 and 80.
• student: A categorical variable (label: Yes/No) indicating whether the customer is a student, randomly
assigned.
• subscribe: A binary outcome variable (1/0) indicating whether the customer subscribes to the service.
You will generate this variable based on a logistic function that includes the effects of age and student
status, with some added random noise to simulate a realistic scenario.
The logistic function is of the form 11+e−z as discussed in class, where z is a linear combination of
predictor variables and β0 = −9, βage = 0.02, βstudent = 3.5. Next, you need to generate the binary
outcome for the logistic regression. Hint: rbinom() may be useful.
2. Logistic Regression Model:
Fit a logistic regression model to predict the probability of subscription (subscribe) using age and
student as predictors. Note: make sure you handle student as an indicator variable.
3. Model Interpretation:
Interpret the coefficients of your model: discuss how age, and student status influence the subscription
likelihood.
4. Model Evaluation:
Evaluate your model using at least two metrics: accuracy, confusion matrix, ROC curve, etc. Interpret
your model performance based on these metrics.
5. Discussion:
Discuss the implications of using indicator variables in logistic regression models. How do they affect
the interpretation of model coefficients?
Question 2
In this exercise, you will explore the K-Nearest Neighbors (KNN) algorithm, a fundamental technique in
statistical Machine Learning for classification problems. You will generate synthetic data to simulate a 2-class
1
classification problem, apply KNN using different values of K, and analyze how the choice of K affects the
model’s decision boundary, misclassification rate, and the concepts of bias and variance. Finally, you will split
your data into training and test sets to identify an optimal K and discuss the implications of this method for
model selection.
Tasks:
Note: For reproducibility, set the seed as in the previous question.
1. Data Generation:
• Generate 100 random data points to simulate a 2-class classification problem, with 50 points for each
class. Use a normal distribution N(0,1) for the first class and N(2,1) for the second class.
• Note: each random data point generated should be (x,y) indicating the position of a point in the 2D
coordinate plane. Both classes share the same standard deviation but differ in their means.
2. KNN Fitting and Misclassification Rate Calculation:
• Fit the KNN classifier to your data using several different values of K = (3, 5, 7, 9).
• For each K, calculate the misclassification rate by comparing the predicted class labels against the true
labels.
3. Decision Boundary Plotting:
• Using base R plotting functions, plot the decision boundary for each K. Use different colors to represent
the different classes.
• Evaluate how the decision boundary changes with different values of K.
4. Variance and Bias Discussion:
• Discuss the concepts of variance and bias in the context of KNN. How do these concepts relate to the
choice of K?
5. Model Selection with Test/Train Split:
• Randomly split your data into a training set (70%) and a test set (30%).
• Refit the KNN model using the training data and evaluate the performance on the test data to choose
an optimal K. Carefully explain your steps.
• Discuss any potential issues with this method of selecting K.