Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT 4620/5620
Assignment 1. The following questions explore the fundamentals of nonparametric statis- tics: (a) [3] Describe smoothing and give two examples of popular smoothers. (b) [2] Consider the generalized additive model (GAM) framework. What is the most significant departure from the GLM framework? (c) [3] Explain how model estimation proceeds for GAMs. (d) [4] Suppose that you find yourself in a situation where both a GLM and a GAM initially seem appropriate for your data. Explain the criteria you would use to determine which of the two methods to recommend. 2. This question re-examines the hubble data. (a) [6] Fit the model: Vi = f(Di) + ϵi to the Hubble data, where f is a smooth function and the ϵi are i.i.d. N(0, σ2). Does a straight line model appear to be most appropriate? How would you interpret the best fitting model? (b) [4] Examine appropriate residual plots and refit the model with more appropriate distributional assumptions. How are your conclusions from part (a) modified? 3. Read and provide a one page summary of the lme4 documentation. [5] 4. The data frame Gun (library nlme) is from a trial examining methods for fir- ing naval guns. Two firing methods were compared, with each of a number of teams of 3 gunners; the gunners in each team were matched to have similar physique (Slight, Average, Heavy). The response variable rounds is rounds fired per minute, and there are 3 explanatory factor variables, Physique (levels Slight, Medium and Heavy); Method (levels M1 and M2) and Team with 9 levels. The main interest is in determining which method and/or physique results in the highest firing rate and in quantifying team- to-team variability in firing rate. (a) [2] Identify which factors should be treated as random and which as fixed, in the analysis of these data. (b) [4] Write out a suitable mixed model as a starting point for the analysis of these data. (c) [6] Analyse the data using lme in order to answer the main questions of interest and report your conclusions. 1 5. The Carseats dataset from the R package ISLR is a simulated dataset of carseat sales at 400 different stores. Full information on the variables in this dataset can be found using help(Carseats) after loading the package. (a) [4] Create a new factor variable for the Carseats representing whether or not Sales is greater than 8. Randomly split the dataset into a testing and training set. On the training set grow a classification tree using the R rpart package to classify whether a store had high carseat sales or not (Hint: Remove the Sales variable). Report the classification accuracy you got on the testing data set and on the training set. (b) [4] Prune the tree you grew in part a. Report the pruned tree’s classi- fication accuracy on the testing data set and on the training set. Why might pruning have improved the classification accuracy on the testing set? Why might it have reduced accuracy on the training set? (c) [4] Grow a random forest using the randomForest package the same way you did the tree. Is performance on the testing set better than the classification trees? Why might that be the case? (d) [4] Briefly outline the similarities and differences between CARTs and random forests.