Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STATS 369
Campus: City STATISTICS Data Science Practice (Time allowed: TWO hours) NOTE: o Answer All questions. o There are 105 marks in total. o Calculator are permited. o This is a restricted book exam. o Answer in clear and concise English. o Good luck! Page 1 of 5 STATS 369 1. Suppose we have fit a regression tree model using the rpart library in R and get the output below from the plotcp function. CP nsplit rel error xerror xstd 1 0.3361423 0 1.00000 1.00141 0.043826 2 0.0620010 1 0.66386 0.67426 0.030258 3 0.0489730 2 0.60186 0.61822 0.027830 4 0.0210710 3 0.55288 0.59130 0.026281 5 0.0097994 4 0.53181 0.57484 0.024669 6 0.0085756 5 0.52201 0.57056 0.024497 7 0.0069582 6 0.51344 0.56902 0.024453 8 0.0037373 7 0.50648 0.55886 0.024046 9 0.0026202 8 0.50274 0.56162 0.024111 10 0.0025871 10 0.49750 0.57228 0.024891 11 0.0024155 15 0.48457 0.58008 0.025106 (a) Briefly explain what each of the columns in the output refer to. (5 marks) (b) Based on this output, what size model would you choose for prediction and why? (5 marks) (c) Explain how random forests combine multiple trees to make predictions and list three advantages of using random forests over single trees. (5 marks) (d) How do single trees deal with missing data and why does this not transfer directly to random forests? (3 marks) (e) The library XGBoost also produces preditive models based on trees. Explain what the trees used in this framework are and how they are combined into a single predictor. (7 marks) [Total: 25 marks] Page 2 of 5 STATS 369 2. Consider we build a multilayer neural network on a set of 28 (height) x 28 (width) x 3 (colours) image data set. The following R code describes the model built in keras. model <- keras_model_sequential() %>% layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = ’relu’, input_shape = input_shape) %>% layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = ’relu’) %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_dropout(rate = 0.25) %>% layer_flatten() %>% layer_dense(units = 128, activation = ’relu’) %>% layer_dropout(rate = 0.5) %>% layer_dense(units = num_classes, activation = ’softmax’) (a) How many convolutional layers does the model have? What do the arguments filters = 64, kernel size = c(3,3) mean? (5 marks) (b) How many trainable parameters does the first layer have? Show your working. (5 marks) (c) How many trainable parameters does the second layer have? Note that by default the padding is not applied. Show your working. (10 marks) (d) What does layer max pooling 2d do, and what does pool size=c(2,2) mean? (3 marks) (e) What does layer dropout(rate=0.5) do? Why do we need a dropout layer? (3 marks) (f) What are the advantages of ReLU over sigmoid or hyperbolic tangent (tanh)? What are the potential issues with ReLU? (4 marks) [Total: 30 marks] Page 3 of 5 STATS 369 3. Suppose a medical diagnostic test with sensitivity 90% and specificity 95% is applied to a sample of 10,000 patients to identify an associated fatal condition for which it occurs to only 1% of the target population. (a) Show a complete confusion matrix based on the information provided. (5 marks) (b) What are the positive predictive value (PPV) and negative predictive value (NPV) for this test? Show your working. (5 marks) (c) Supposed you are a medical doctor; based on all the numerical results here, how would you describe the accuracy of the test to your patient using a 1-2 non-technical sentence(s)? (5 marks) (d) Would you use AUC (i.e. area under ROC) as a metric of success here? Why or why not? (5 marks) [Total: 20 marks] Page 4 of 5 STATS 369 4. You are tasked by the human resource manager (HR) of an artificial intelligence start-up to build a predictive model that automatically ranks job candidates on a scale of 1 to 5 stars. The available data set is a list of cover letter and resume´ documents submitted by applicants. (a) Briefly describe how would you build such a model. Your response should cover the following aspects. (i) Think about the information that is commonly included in a resume´, describe what attributes can be generated from text as predictors. (5 marks) (ii) Describe one predictive modelling approach we have discussed in the course that would be suitable for this task. Justify your choice. (10 marks) (iii) Briefly discuss the technical limitations of the model you choose in part (a)(ii), and the strategies to overcome them. (5 marks) (b) Amazon, the e-commerce giant, has actually built an experimental AI hiring tool for technical positions such as software engineers and system architects. It was shut down after discovering that it consistently discriminated against women. (i) Why do you think the model produced a gender bias? (5 marks) (ii) How would you detect such a bias and reduce the risk of unethical use of this AI tool? (5 marks) [Total: 30 marks] Page 5 of 5