STAT 8310 - Data Analysis
Data Analysis
STAT 8310 - Data Analysis I
Final - Fall 2020
Instructions: Provide answers to each of the four questions below. Points for each problem are specified.
• For each part of each problem, write your answers in complete sentences providing context and statis-
tical justification.
• Do not include R code in your solutions; rather, use R as a means for doing your analysis.
• You are not allowed to discuss any part of these analyses with anyone other than me. Contact me if
you have questions.
1. Nine well-trained cyclists participated in a study in which they were each given 3 doses of caffeine (0, 5,
13mg) and their endurance performance time was measured. The data are given in caffeineCycling.Rdata.
(a) (5 points) Using nonparametric methods, determine whether there is a significant difference in
endurance performance between the 0 and 5mg caffeine dosages. Provide a written summary of
your results along with statistical support to justify your recommendation.
(b) (5 points) Assuming you are interested in comparing all 3 dosages of caffeine, what type of model/test
would you suggest to your collaborators to investigate the relationship between caffeine and en-
durance performance. [You don’t need to do the modeling/testing.]
2. Assuming you’ve been following the news, both Moderna and Pfizer/BioNTech have developed vaccina-
tions for the coronavirus with efficacy of approximately 90-95%. It has been reported recently (https:
//science.sciencemag.org/content/370/6520/1022) that there are some side effects for these vac-
cines, including headache and fatigue. The following table provides information about the results of
the clinical trials of both Moderna and Pfizer/BioNTech. Note that both clinical trials administered
the vaccine to half of the participants while the other half received a placebo. The trials were double
blind meaning that neither the physician nor the participant knew whether they received the vaccine or
placebo.
Moderna Pfizer/BioNTech
Total number of participants in the clinical trial 30,000 43,000
Proportion of vaccine recipients reporting fatigue 0.097 0.038
Proportion of vaccine recipients reporting headache 0.045 0.020
Proportion of vaccine recipients reporting both headache & fatigue 0.024 0.017
(a) (5 points) Compute a 95% confidence interval for the risk of fatigue from the Moderna vaccine.
(b) (5 points) Perform a statistical test to determine whether there is a difference in the risk of headache
between the two vaccines.
(c) (5 points) Suppose with a type I error rate of α = 0.05, you want to simultaneously test the
hypotheses that the risk of having both a headache and fatigue for the Moderna vaccine is different
than 0.02 and the risk of having both a headache and fatigue for the Pfizer/BioNTech vaccine is
different that 0.02. What is the power of each test?
(d) (5 points) Pooling the data from both Moderna and Pfizer/BioNTech, test whether headache and
fatigue are independent side effects of the vaccine. Provide statistical support to justify your answer.
3. (30 points) Consider the data lakeNitrogen.Rdata, which are observations taken at 200 lakes across
the northeastern part of the United States. The data, which come from a research project I am currently
working on, consist of measures of total nitrogen in the lake (measured in micrograms per liter) as well
as important explanatory variables of nitrogen (e.g., land use, lake characteristics). Specifics about the
variables are given in the table below. The researcher is interested in determining what variables are
important for explaining total nitrogen in lakes.
TN Total nitrogen in the lake
Baseflow measure of stream flow between precipitation events
NO3Depo nitrite deposited into the lake from the atmosphere, usually
through precipitation)
TotalDepo total nitrogen deposited into the lake from the atmosphere,
usually through precipitation
Runoff measure of precipitation that flows off the surface of the land into the lake
Urban percent of area around the lake classified as urban
Rowcrop percent of area around the lake classified as rowcrop (agricultural land)
Pasture percent of area around the lake classified as pasture
Forest percent of area around the lake classified as forest
Wetland percent of area around the lake classified as wetland
LakeArea area of the lake
MaxDepth maximum depth of the lake
Connectivity categorical variable for whether it is located downriver from a lake & stream,
located downriver from a stream, or an isolated/headwater lake
LWR lake-to-watershed ratio, measures the proportion of the watershed for which the
lake makes up
Your task is to serve as the data analyst on the project. Build a regression model for total nitrogen using
the techniques we learned in class. Prepare a final report outlining the steps of your analysis. Include
a detailed summary of your final model providing statistical justification for your choice. Make sure to
report the fitted regression equation. Your model must fit the data well, be interpretable, parsimonious,
and not violate any model assumption. I would encourage you to do a fair amount of exploratory data
analysis to get a feel for the different variables before you begin building your model.
4. The dataset training.Rdata contains training session data for four collegiate athletes. In particular,
included in the dataset are the rate of perceived effort (RPE) for each athlete for 16 different training
sessions. RPE is a measure that each athlete assigns to a training session based on their perceived
level of difficulty. Also included in the data is a predefined ordinal intensity for each training session
(low=L1, low/moderate=L2, moderate/high=L3, and high=L4) as well as position played (F=forward,
M=midfield, D=defense).
(a) (5 points) Disregard the athlete variable to start. Is there a significant difference in RPE for the
different intensities of training session? Is there a significant difference in RPE for the different
positions? Summarize your results in context providing statistical justification.
(b) (5 points) Consider an appropriate two-factor fixed effects model for RPE with intensity and po-
sition as fixed factors. Write out the model explicitly in terms of the variables/parameters. Make
sure to define all terms.
(c) (5 points) Fit the two factor fixed effects model in (b) and interpret your results.
(d) (5 points) For the two factor fixed effects model in (b), compare the three positions in terms of
the linear and quadratic contrasts of intensity. Summarize your findings in context and provide
statistical justification.
(e) (5 points) Returning to the full data, notice that 4 athletes were observed in this data. That is, each
athlete was subjected to each combination of training intensity and position. One could consider
the 12 intensity × position combinations as 12 different treatments. Explain why we should include
athlete as a “block” in our model.
(f) (5 points) Discuss under what scenarios “athlete” should be considered as a fixed block effect versus
a random block effect. Is randomization necessary in the experimental design? Why or why not.
(g) (5 points) Assume a mixed effects model with athlete as a random block effect and intensity, posi-
tion, and their interaction as fixed effects. Let Yijk be the response for athlete i, intensity j, position
k. What are the following:
• cov(Y111, Y122)
• cov(Y111, Y211)
• var(Y111)
(h) (5 points) Fit the mixed effects model with athlete as a random block effect and intensity, position,
and their interaction as fixed effects. Report the efficiency gain of including the random block effect.