Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT 310
1 Reading and Material
• Read Chapter 7 of the text.
For full credit, all plots need to have titles and x and y axes, as well as any color or fill variables clearly
named with words, not variable names. See in class code for syntax help.
For this week, you should use MA county variables 2020.RData and MA COVID19 22 03 01.RData.
Hint: because you are using LaTeX to create a .pdf output, you can use LaTeX mathematical expressions in the text parts
of your .Rmd file and have them render as mathematical expressions. A math expression must be preceded and followed by $.
For example $\hat{y}=3$ renders as yˆ = 3.
2 Homework Questions
For this week’s homework, you will use several of the variables you created last week. Before executing these parts, I suggest
creating a block of R code to create all the new continuous and categorical variables you created last week. Save them all
in ma extra so you can use them for the later parts. You will probably want to keep this block of code and re-run it every
week. Note that some of these values are stable week-to-week and could be saved in a file (e.g. percent over 85), but some
of them will change each week with the new COVID numbers (e.g. deaths per 100000).
1. In last week’s homework, you modeled y = deaths per 100000 for Massachusetts counties as a function of x =people per Housing.
You looked at the residuals and thought about why there might be some extreme values. This week, you will expand
this model with additional covariates. In this question, we will add a categorical predictor, region.
(a) Conduct EDA with the predictors people per Housing and region. (look at variables, summary statistics, and
plot). For the plot, you should include the plot with points colored by region, and the interaction regression lines
plotted. Do you think the two predictor variables are related to each other?
(b) Fit an interaction model with people per Housing and region as predictors. Write out the expression for the fitted
model using PPH to represent people per Housing, and 1( ) to represent and indicator variable. Interpret each
coefficient of the fitted model.
(c) Write out the linear expression for the fitted line for each region. (it is OK to round to the nearest whole numbers
for this and the following regression lines)
(d) One of the intercepts is much higher than the others. Does that mean this region has many more COVID deaths?
If not, why is the intercept so large? Be sure to look at the plot as you answer this.
(e) Now plot the scatter plot with parallel slopes fitted lines. Does it look much different from the previous plot?
(f) Fit a parallel slopes model for the same data. Write out the expression for the fitted model.
(g) Write out the linear expression for the fitted line for each region. Do you notice any major changes from the previous
fit?
(h) (EXTRA CREDIT points) Use the residuals from each of the model fits above to compute the sum of squared
errors (SSE) for each model fit, as well as the fits from last week with each of these two predictors alone (last week
questions 3d and 4c. Use the knitr::kable function to neatly display these 4 numbers.
(i) (EXTRA CREDIT points) Smaller SSE means that more of the variation in the data is explained by the model (i.e.
less variation is unexplained, or error. Residuals or errors are less). Which single variable explains more variation,
Region only or people per housing? Does adding the second variable substantially improve the model fit? Does the
interaction model substantially improve the model fit beyond the parallel slopes model fit? Based on the principle
of Occam’s Razor (simpler is better if it works), which model would you prefer for these data?
2. In this question, we will add other predictors to last week’s model of deaths per 100000 by people per Housing.
1
(a) Look at the raw data and summary statistics for deaths per 100000 as well as predictors percent poverty, people
per housing, median income, and median age.
(b) Use the cor() function to examine the pairwise correlation correlations between all pairs of these variables in one
line of code. Which predictors are most strongly correlated with each other?
(c) Plot deaths per 100000 against each of the predictors in the previous part. Show the fitted regression lines for
each. Display the plots with grid.arrange().
(d) Fit a regression model modeling deaths per 100000 as a function of median age. Interpret the slope of this line. Is
the intercept meaningful in this model?
(e) Now fit a regression model modeling deaths per 100000 as a function of median age and people per housing.
Interpret the three parameters of this model.
(f) In the previous two parts, the fitted coefficients of median age have opposite signs. How is this possible?
(g) Now fit a model for deaths per 100000 as a function of median age and people per housing, and percent poverty.
Interpret the coefficient of median age in this model.
3. Your book claims that the correlation coefficient is invariant to transformation. Verify this by computing the correlation
coefficients between people per Housing and each of deaths per capita and deaths per 100000. Are these values the
same?
4. Your book discusses Simpson’s Paradox, an important seeming paradox of statistics.
(a) Briefly describe Simpson’s Paradox
(b) Find an example of Simpson’s Paradox in your homework above.
(c) Search the internet. Find an example of Simpson’s paradox. Show the data here, and briefly describe the situation
and why it is a paradox.
3 Pre-Lecture Check
• Complete this week’s pre-lecture check about the new reading directly on Gradescope.
4 Project Work
In this week’s project work, you will gather data on your state.
Keep track of key references and use them in your report as well as in your homework. Also, keep links to helpful
websites. You may want to visit them again to grab data or check your findings.
1. Download the New York Times daily county-level COVID-19 data for your state and create tibbles like the ones we
used for Massachusetts for your state. Use glimpse() to verify that your data are reasonable. How many counties
are there in your state?
2. Compile an initial set of county-level covariates of interest for your state. You may find the file on Moodle in which
I summarized data sources for the Massachusetts data we are using to be useful. You may find as many covariates
as you like, as long as you satisfy the following:
• Find at least 5 continuous and 1 categorical covariates, and at least 7 total covariates.
• Find at least one covariate that I did not include in the Massachusetts data we have been using.
Your covariates should include the most important features of differences across your state’s counties as revealed in
last week’s homework. You may create your own variables too, for example by assigning or selecting state regions
that make sense based on a map of state features. Or based on which counties contain a large city or airport or a
certain type of industry. Note that making new variables will probably take a bit more time than you think. For
this assignment, provide univariate plots (histograms or bar charts) and summary statistics for each variable using
skim without charts(), and a brief description of the meaning of each variable.
Turn in 1 per team, including the names of all team members, on gradescope as a group assignment.
Page 2