Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ISE529 HW3
Note: Submit the .Rmd file and the output PDF file in blackboard before the deadline. We will grade your last submitted work before the deadline. Question 1 In this exercise, you will explore logistic regression. You will work with a synthetic dataset that simulates customer data for a subscription service. The goal is to predict whether a customer will subscribe (1) or not subscribe (0) to a service based on their age and whether they are a student. The “student” variable is categorical (Yes/No) and will be used as an indicator variable in your logistic regression model. Tasks: Note: For reproducibility, set the seed deliberately in your code for task 1. You can choose any number you like but try one that is unique to you, such as the last four digits of your student ID. 1. Data Generation: Create a synthetic dataset with 200 observations that include the following variables: • age: A integer variable representing the customer’s age, randomly generated from a uniform distribution, between 18 and 80. • student: A categorical variable (label: Yes/No) indicating whether the customer is a student, randomly assigned. • subscribe: A binary outcome variable (1/0) indicating whether the customer subscribes to the service. You will generate this variable based on a logistic function that includes the effects of age and student status, with some added random noise to simulate a realistic scenario. The logistic function is of the form 11+e−z as discussed in class, where z is a linear combination of predictor variables and β0 = −9, βage = 0.02, βstudent = 3.5. Next, you need to generate the binary outcome for the logistic regression. Hint: rbinom() may be useful. 2. Logistic Regression Model: Fit a logistic regression model to predict the probability of subscription (subscribe) using age and student as predictors. Note: make sure you handle student as an indicator variable. 3. Model Interpretation: Interpret the coefficients of your model: discuss how age, and student status influence the subscription likelihood. 4. Model Evaluation: Evaluate your model using at least two metrics: accuracy, confusion matrix, ROC curve, etc. Interpret your model performance based on these metrics. 5. Discussion: Discuss the implications of using indicator variables in logistic regression models. How do they affect the interpretation of model coefficients? Question 2 In this exercise, you will explore the K-Nearest Neighbors (KNN) algorithm, a fundamental technique in statistical Machine Learning for classification problems. You will generate synthetic data to simulate a 2-class 1 classification problem, apply KNN using different values of K, and analyze how the choice of K affects the model’s decision boundary, misclassification rate, and the concepts of bias and variance. Finally, you will split your data into training and test sets to identify an optimal K and discuss the implications of this method for model selection. Tasks: Note: For reproducibility, set the seed as in the previous question. 1. Data Generation: • Generate 100 random data points to simulate a 2-class classification problem, with 50 points for each class. Use a normal distribution N(0,1) for the first class and N(2,1) for the second class. • Note: each random data point generated should be (x,y) indicating the position of a point in the 2D coordinate plane. Both classes share the same standard deviation but differ in their means. 2. KNN Fitting and Misclassification Rate Calculation: • Fit the KNN classifier to your data using several different values of K = (3, 5, 7, 9). • For each K, calculate the misclassification rate by comparing the predicted class labels against the true labels. 3. Decision Boundary Plotting: • Using base R plotting functions, plot the decision boundary for each K. Use different colors to represent the different classes. • Evaluate how the decision boundary changes with different values of K. 4. Variance and Bias Discussion: • Discuss the concepts of variance and bias in the context of KNN. How do these concepts relate to the choice of K? 5. Model Selection with Test/Train Split: • Randomly split your data into a training set (70%) and a test set (30%). • Refit the KNN model using the training data and evaluate the performance on the test data to choose an optimal K. Carefully explain your steps. • Discuss any potential issues with this method of selecting K.