Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT 310
1 Reading and Material • Read Chapter 7 of the text. For full credit, all plots need to have titles and x and y axes, as well as any color or fill variables clearly named with words, not variable names. See in class code for syntax help. For this week, you should use MA county variables 2020.RData and MA COVID19 22 03 01.RData. Hint: because you are using LaTeX to create a .pdf output, you can use LaTeX mathematical expressions in the text parts of your .Rmd file and have them render as mathematical expressions. A math expression must be preceded and followed by $. For example $\hat{y}=3$ renders as yˆ = 3. 2 Homework Questions For this week’s homework, you will use several of the variables you created last week. Before executing these parts, I suggest creating a block of R code to create all the new continuous and categorical variables you created last week. Save them all in ma extra so you can use them for the later parts. You will probably want to keep this block of code and re-run it every week. Note that some of these values are stable week-to-week and could be saved in a file (e.g. percent over 85), but some of them will change each week with the new COVID numbers (e.g. deaths per 100000). 1. In last week’s homework, you modeled y = deaths per 100000 for Massachusetts counties as a function of x =people per Housing. You looked at the residuals and thought about why there might be some extreme values. This week, you will expand this model with additional covariates. In this question, we will add a categorical predictor, region. (a) Conduct EDA with the predictors people per Housing and region. (look at variables, summary statistics, and plot). For the plot, you should include the plot with points colored by region, and the interaction regression lines plotted. Do you think the two predictor variables are related to each other? (b) Fit an interaction model with people per Housing and region as predictors. Write out the expression for the fitted model using PPH to represent people per Housing, and 1( ) to represent and indicator variable. Interpret each coefficient of the fitted model. (c) Write out the linear expression for the fitted line for each region. (it is OK to round to the nearest whole numbers for this and the following regression lines) (d) One of the intercepts is much higher than the others. Does that mean this region has many more COVID deaths? If not, why is the intercept so large? Be sure to look at the plot as you answer this. (e) Now plot the scatter plot with parallel slopes fitted lines. Does it look much different from the previous plot? (f) Fit a parallel slopes model for the same data. Write out the expression for the fitted model. (g) Write out the linear expression for the fitted line for each region. Do you notice any major changes from the previous fit? (h) (EXTRA CREDIT points) Use the residuals from each of the model fits above to compute the sum of squared errors (SSE) for each model fit, as well as the fits from last week with each of these two predictors alone (last week questions 3d and 4c. Use the knitr::kable function to neatly display these 4 numbers. (i) (EXTRA CREDIT points) Smaller SSE means that more of the variation in the data is explained by the model (i.e. less variation is unexplained, or error. Residuals or errors are less). Which single variable explains more variation, Region only or people per housing? Does adding the second variable substantially improve the model fit? Does the interaction model substantially improve the model fit beyond the parallel slopes model fit? Based on the principle of Occam’s Razor (simpler is better if it works), which model would you prefer for these data? 2. In this question, we will add other predictors to last week’s model of deaths per 100000 by people per Housing. 1 (a) Look at the raw data and summary statistics for deaths per 100000 as well as predictors percent poverty, people per housing, median income, and median age. (b) Use the cor() function to examine the pairwise correlation correlations between all pairs of these variables in one line of code. Which predictors are most strongly correlated with each other? (c) Plot deaths per 100000 against each of the predictors in the previous part. Show the fitted regression lines for each. Display the plots with grid.arrange(). (d) Fit a regression model modeling deaths per 100000 as a function of median age. Interpret the slope of this line. Is the intercept meaningful in this model? (e) Now fit a regression model modeling deaths per 100000 as a function of median age and people per housing. Interpret the three parameters of this model. (f) In the previous two parts, the fitted coefficients of median age have opposite signs. How is this possible? (g) Now fit a model for deaths per 100000 as a function of median age and people per housing, and percent poverty. Interpret the coefficient of median age in this model. 3. Your book claims that the correlation coefficient is invariant to transformation. Verify this by computing the correlation coefficients between people per Housing and each of deaths per capita and deaths per 100000. Are these values the same? 4. Your book discusses Simpson’s Paradox, an important seeming paradox of statistics. (a) Briefly describe Simpson’s Paradox (b) Find an example of Simpson’s Paradox in your homework above. (c) Search the internet. Find an example of Simpson’s paradox. Show the data here, and briefly describe the situation and why it is a paradox. 3 Pre-Lecture Check • Complete this week’s pre-lecture check about the new reading directly on Gradescope. 4 Project Work In this week’s project work, you will gather data on your state. Keep track of key references and use them in your report as well as in your homework. Also, keep links to helpful websites. You may want to visit them again to grab data or check your findings. 1. Download the New York Times daily county-level COVID-19 data for your state and create tibbles like the ones we used for Massachusetts for your state. Use glimpse() to verify that your data are reasonable. How many counties are there in your state? 2. Compile an initial set of county-level covariates of interest for your state. You may find the file on Moodle in which I summarized data sources for the Massachusetts data we are using to be useful. You may find as many covariates as you like, as long as you satisfy the following: • Find at least 5 continuous and 1 categorical covariates, and at least 7 total covariates. • Find at least one covariate that I did not include in the Massachusetts data we have been using. Your covariates should include the most important features of differences across your state’s counties as revealed in last week’s homework. You may create your own variables too, for example by assigning or selecting state regions that make sense based on a map of state features. Or based on which counties contain a large city or airport or a certain type of industry. Note that making new variables will probably take a bit more time than you think. For this assignment, provide univariate plots (histograms or bar charts) and summary statistics for each variable using skim without charts(), and a brief description of the meaning of each variable. Turn in 1 per team, including the names of all team members, on gradescope as a group assignment. Page 2