Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Department of Statistics
STATS 330: Advanced Statistical Modelling Assignment 4
Total: 60 marks Due: 11:59pm, 14th October, 2021 Notes: (i) Write your assignment using R Markdown. Knit your report to either a Word or PDF document. (ii) Create a section for each question. Include all relevant code and output in the final document. (iii) Five presentation marks are available. Please think of your markers - keep your code and plots neat and remember to check your spelling. (R Markdown has an inbuilt spellchecker!) (iv) Please remember to upload your Word or PDF document to CANVAS by the due date. 1 In this assignment, we will examine modelling from a number of perspectives — all of which should be considered when deciding how to model data: • inspecting the data and making allowances for other background events pertinent to these data, • looking at how these models fit your data, • see how well the model predicts, • using all the above ideas to see which model you are most comfortable with. 2 Question 1 In this question, we will revisit the caffeine data in Assignment 3 - Question 3 and and revisit it via the bootstrap re-sampling. Caffeine peak data Re-read the section describing these data from Assignment 3 - Question 3 — if necessary. (a) Obtain 1000 non-parametric bootstrap estimates of xpeak from the data provided in Assignment 3 - Question 3 (referred to as Caffeine2.df). Plot a histogram of these estimates. Comment briefly. [5 marks] Note: You will need to obtain the original estimated xpeak calculated value from Assignment 3 - Question 3. In order to do this non-parametrically, instead of using modeled probabilities, use the observed probabilities from your data. An explanation by example —- the 0mg caffeine data point we observed y = 109 out of 300 trials. That is a grouping of a vector containing 109 1s and 191 0s . A random sample (with replacement) of 300 from a vector like this is the equivalent of random binomial sample Y ∼ Binomial (n = 300, p = pˆ = 109300). (b) Calculate a bootstrap confidence interval (CI) for xpeak based on these bootstrap samples. [5 marks] (c) How does this CI compare to the CI you obtained using simulation in Assignment 3 - Question 3. [5 marks] 3 Question 2 The following example is based on a classic machine learning data set. The US post office wants to be able to electronically scan hand-written numbers and be able to predict what was written for the ZIP code. A ZIP code tells them where a mail item needs to be delivered. We will look at the numbers 3 and 7 only — to see if we can distinguish these two numbers — to illustrate ideas about prediction based on logistic regression. In this assignment, you will be using logistic regression in the context of optical character recognition. Each line of the data set is a digital representation of a scanned hand-written digit, originally a part of a zip code handwritten on a US letter. Each hand-written digit is represented as a 16 x 16 array of pixels, as in the examples below. 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 3 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 7 Each pixel is given a grey-scale value in the range [−1, 1] with -1 representing white and 1 representing black. There are thus 16 x 16 = 256 numbers representing a particular digit, which we can take as the values of 256 variables, v1, . . . , v256, say. The relationship between the pixels and the variables is as follows: v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 v21 v22 v23 v24 v25 v26 v27 v28 v29 v30 v31 v32 v33 v34 v35 v36 v37 v38 v39 v40 v41 v42 v43 v44 v45 v46 v47 v48 v49 v50 v51 v52 v53 v54 v55 v56 v57 v58 v59 v60 v61 v62 v63 v64 v65 v66 v67 v68 v69 v70 v71 v72 v73 v74 v75 v76 v77 v78 v79 v80 v81 v82 v83 v84 v85 v86 v87 v88 v89 v90 v91 v92 v93 v94 v95 v96 v97 v98 v99 v100 v101 v102 v103 v104 v105 v106 v107 v108 v109 v110 v111 v112 v113 v114 v115 v116 v117 v118 v119 v120 v121 v122 v123 v124 v125 v126 v127 v128 v129 v130 v131 v132 v133 v134 v135 v136 v137 v138 v139 v140 v141 v142 v143 v144 v145 v146 v147 v148 v149 v150 v151 v152 v153 v154 v155 v156 v157 v158 v159 v160 v161 v162 v163 v164 v165 v166 v167 v168 v169 v170 v171 v172 v173 v174 v175 v176 v177 v178 v179 v180 v181 v182 v183 v184 v185 v186 v187 v188 v189 v190 v191 v192 v193 v194 v195 v196 v197 v198 v199 v200 v201 v202 v203 v204 v205 v206 v207 v208 v209 v210 v211 v212 v213 v214 v215 v216 v217 v218 v219 v220 v221 v222 v223 v224 v225 v226 v227 v228 v229 v230 v231 v232 v233 v234 v235 v236 v237 v238 v239 v240 v241 v242 v243 v244 v245 v246 v247 v248 v249 v250 v251 v252 v253 v254 v255 v256 (a) Use the code below to import this training data. Make sure that there are no missing values, the variable D only has the values 3 and 7 and, that variables V1, V2 . . . V256 have values between −1 and 1. [5 marks] 4 train.df = read.table("train.txt") names(train.df) = c("D", paste("V", 1:256, sep="")) (b) The code below lets us look at the first 25 sample of handwritten 3s and 7s. Submit this code and comment of which of the 256 cells you think would be best at discrimination between a 3 and a 7. Comment briefly. [5 marks] par(mfrow=c(5,5), mar = c(1,1,1,1)) for(k in 1:25){ z = matrix(unlist(train.df[k,-1]), 16,16) zz = z for(j in 16:1)zz[,j]=z[,17-j] image(zz, col = gray((32:0)/32)) box() text(0.1,0.9,train.df$D[k], cex=1.5) } In order to make the number of variables manageable we will get you to select the variables V1, V2 . . . V256 that have the highest correlation (in absolute terms - large negative or positive values) with the response variable D. (c) Compute the correlation between D and V1, V2 . . . V256 and identify the which of these variables have the 20 highest absolute correlations (i.e. either large and negative or large and positive). [5 marks] Hints: The R functions names, sort and abs may be useful here. (d) Fit a logistic model to the data, using the 20 variables you identified above. Call the model object Full.mod for future reference. (The regression will not converge if you use all 256 variables.) Calculate the fitted logits for each of these hand-written digits. [5 marks] Hint: Append a binary variable Y to the data-frame train.df as follows: # recode 7 as 1, 3 as 0 train.df$Y = ifelse(train.df$D==7,1,0) (e) Use your fitted model to predict if a digit is a 3 or a 7 on the basis of the 20 variables. (Predict a digit to be a 7 if the fitted probability of a 7 is more than 0.5 (or equivalently, if the corresponding logit is positive.) Evaluate the in-sample prediction error (using the same data set to fit the model and evaluate the error). [5 marks] (f) Use the following code to do a step-wise variable selection process to choose a more parsimonious submodel of the 20-variable model. What are the in-sample prediction errors (PE) for this more parsimonious model and how does it compare to the PE for Full.mod calculated above? [5 marks] 5 null.model = glm(Y~1, data=train.df, family=binomial) #step(Full.mod, direction = "backward") step(null.model, formula(Full.mod),direction="both", trace=0) (g) Use cross-validation to estimate the prediction error rate (as a %) for this reduced model. Comment briefly. Reminder: You will need to create a function called, say, PE.fun to do this. [5 marks] (h) Use the data set test.txt to calculate an estimate of the “out-of-sample” prediction error for the model used above. Here, we use the original data set to fit the model, and the new data to calculate the prediction error. Comment briefly on what this may indicate. [5 marks] (i) Challenge: Can you improve upon the prediction errors discussed above? [Only glory! 0 marks]