Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STA238 Embedded R chunks
In this R Lab, you will be guided by your TA on how to adapt bootstrapping and simulation to construct
confidence intervals and perform hypothesis tests on two-sample data when normality assumptions fail. At
this point, it is assumed that you will have decent familiarity in creating histograms using ggplot to assess
distribution shape. You are expected to have access to R + R Studio + LaTeX in some capacity by this
tutorial, whether it be locally installed or accessed through JupyterHub.
Note: You are expected to join your tutorial meeting with your U of T licensed Office 365
account. Your TAs will not be letting guests join the meeting!
R Lab Learning Goals
• Review: Using filter in dplyr (a package already contained in tidyverse) to filter data in dataframe
according to some criteria.
• Review: Using histograms to gauge normality, and a visual check on equal-variance assumption.
• Use empirical bootstrap to generate confidence intervals for the difference in population means (for
simplicity, you will use the percentile method instead of the studentized method).
• Use permutation test to test for equal population means under assumptions of equal variance.
Use the tasks in this R lab to complete the exercises at the end of this document.
Some additional resources you may find helpful as you learn is the R Markdown Cheatsheet and the LaTeX
Cheatsheet. As you work more in LaTeX, the notation will get easier and become second nature!
Submission Instructions
Your submission should be completed in a NEW R Markdown file that is set to ‘knit to pdf’ that contains
only the exercise portion found at the end of this document. Your final submission should include:
• The original questions (copy and paste is fine)
• Embedded R chunks
• Corresponding outputs for the exercises. Note that you must not include a printout of large data sets!
• Required graphs should include informative and clearly communicated axis labels and titles. Adjust
your R chunk options to produce reasonably sized graphs in your output
• Discussion questions should be answered using proper punctuation and grammar, to the best of your
abilities. Treat these responses as a formal write-up, even if it is short!
• You must submit both your .rmd and knitted .pdf file on Quercus for your current tutorial assignment
as it appears on Quercus.
• The grading rubric for the R Lab is available on your R Lab # 4 submission page.
1
-1. Note
The presentation of this tutorial is adapted from section 10.6 of Modern Mathematical Statistics with
Applications. Refer to this if you are interested in the material presented here as well as non-parametric
approaches to inference on paired data and variances!
0. Motivation
The data we will be working with today is from a study reported in A Longitudinal Study of the Development
of Elementary School Children’s Private Speech. In this study, researchers recorded instances where children
were engaged in private speech during a randomly chosen 10 second interval. Private speech refers to speech
spoken aloud that is either addressed to oneself or to no particular listener. Many such intervals were observed
for each child, and the percentage of intervals where private speech occurred for each child was recorded.
One of the outcomes of interest for this study and the focus of our tutorial is the following research question:
Does the amount of private speech differ between boys and girls?
To put this question into a mathematical context, let (X1, ...Xm) and (Y1, ..., Yn) denote the percentage of
private speech observed for boys and girls, respectively. Furthermore, assume that each observation Xi comes
from a population with mean µ1 and Yj from a population with mean µ2. This problem can then be viewed
as that of making inferences about differences in means µ1 − µ2, or determining whether boys and girls have
different engagements in private speech.
1. Exploratory data analysis
To begin exploring this question, let us first load in the study data and the relevant libraries for our analysis.
library(tidyverse)
library(latex2exp)
library(gridExtra)
# Load in the speech data
# Remember to set your working directory and ensure data files are stored in the same location
speech <- read.csv("speech.csv")
Next we will plot the data for boys and girls separately to get an idea of their distribution. For the purposes
of this analysis, the labels boy and girl refer to a child being labeled as male or female in the data set.
###########################################
### Remove the eval=F from the R chunk! ###
###########################################
# Plot the samples for boys and girls using gridExtra
# Density histograms to compare the density histogram of the two samples
plot.boys <- speech %>%
## INPUT ## %>%
ggplot(aes(x = percentage, y =..density..)) + #density histogram
geom_histogram(binwidth = 0.5, center = 0.25, #bin aesthetics
color = "black",
fill = "thistle2") +
labs(title = "Private speech: Boys",
x = "Percentage",
y = "Density") +
xlim(c(0,30)) #common range
2
plot.girls <- speech %>%
## INPUT ## %>%
ggplot(aes(x = percentage, y =..density..)) +
geom_histogram(binwidth = 0.5, center = 0.25,
color = "black",
fill ="thistle2") +
labs(title = "Private speech: Girls",
x = "Percentage",
y = "Density") +
xlim(c(0,30))
grid.arrange(plot.boys, plot.girls)
From these plots it is clear that the data is not normal and with the limited data size, any confidence intervals
or hypothesis tests based on the assumption of normality will be misleading. In such situations, we typically
relax the assumption of normal data and turn to non-parametric approaches to form inferences about the
quantities of interest.
In the following sections, we will demonstrate how confidence intervals can be obtained using a bootstrap
approach and how the so-called permutation test can be used to perform hypothesis testing with minimal
distributional assumptions.
3
2. Bootstrap confidence interval for the difference in means
In this first section we will demonstrate how to construct a confidence interval for the difference in means
µ1 − µ2 using a bootstrap approach. The general approach is very similar to that of the one-sample problem:
• Draw a bootstrap sample (X∗1 , ..., X∗m) with replacement from the first group (X1, ..., Xm)
• Draw a bootstrap sample (Y ∗1 , ..., Y ∗m) with replacement from the second group (Y1, ..., Ym)
• Compute the difference in means X¯∗ − Y¯ ∗ for the bootstrap samples
• Repeat this process many times to obtain the bootstrapped sampling distribution
• Construct a bootstrap confidence interval (C∗lower, C∗upper) from the sampling distribution.
In this presentation, we will be using the percentile method to construct the (1 − α)-level confidence
intervals in the final step, where C∗lower and C∗upper are taken to be the α/2 and 1− α/2 sample quantiles of
the bootstrap sampling distribution, respectively. This method is simple, and makes few assumptions about
the shape of the bootstrap sampling distribution, though the resulting interval is typically biased. To see
another approach similar to the studentized approach from the previous tutorial, see the discussion in section
10.6 of the textbook.
2.1 Implementation in R
Let’s use this recipe to construct a bootstrapped confidence interval in R:
###########################################
### Remove the eval=F from the R chunk! ###
###########################################
# Create vectors containing the observed data for each group
# Define the bootstrap experiment
The 95% bootstrap confidence interval for the difference in means is then
###########################################
### Remove the eval=F from the R chunk! ###
###########################################
# Compute the 95% percentile confidence interval
ci.mean.diff <- quantile(boot.mean.diff, probs = c(0.025, 0.975))
The corresponding confidence interval based on the assumption of normal data can be calculated to be
(0.55, 9.63), which has a smaller lower limit compared to our bootstrapped interval. Given that our data is
definitely not normal, we have more confidence in the validity of the bootstrap-based interval.
4
3. Permutation test for hypothesis testing
Another common approach to studying the difference in means between two groups is the hypothesis testing
approach. Here, we consider the following null and alternative hypotheses:
• H0 : µ1 − µ2 = 0
• H1 : µ1 − µ2 6= 0
Since our data is not normally distributed, we can no longer use the two-sample t−test approach, where the
null distribution of the sample statistic is a known t−distribution.
3.1 The permutation test
If we are comfortable assuming that the only possible difference in the distribution of the two samples is in
their means µ1, µ2, then we can use what is called the permutation test to perform this hypothesis test.
Under the null hypothesis µ1 = µ2, where both sampling distributions are the same, every observation
could equally well belong to either group and so the group labels have no effect on the outcome. Since the
group labels are exchangeable, we can randomly assign them to the observed values and we will obtain what
looks like a new observation from the null distribution with the same probability as the original sample. By
comparing our observed difference X¯ − Y¯ to the difference measured on these new permuted data, we can get
a sense of how unlikely it is to observe such a difference under the null hypothesis.
Following this reasoning, the recipe for the permutation test is as follows:
• Compute the test statistic Tobs = X¯ − Y¯ on the original data (this test statistic is labeled T, but does
not imply that it has a T-distribution).
• Create a permuted data set by randomly assigning the group labels to the observed values and compute
the test statistic T ∗ = X¯∗ − Y¯ ∗. Note: this is not the same as sampling with replacement.
Rather we are relabeling each of the data points as a ’male’ or ’female’.
• Repeat this process for every permutation of the data (or a very large number of permutations)
• Compute p−value of Tobs by computing the proportion of permuted test statistics T ∗ that are at least
as “extreme” as Tobs.
3.2 Implementation in R
We implement the permutation test for our data as follows:
# Compute the observed difference between genders
# Permutation test
The p−value for the our particular value of (FILL THIS IN) is then
# Compute p-value
where we have defined “extreme” to mean |T | ≥ |Tobs|, which is a two-tailed test. Therefore we reject the
null hypothesis that µ1 = µ2 at the 5% level.
Note
The assumption that both distributions are the same in every possible way except perhaps the mean is a very
strong assumption that should typically be checked. In practice we usually require this be approximately
true by checking that the sample variances are roughly equal. If we check this for our the speech data set, we
find that this is unlikely satisfied and that we may want to use another approach for testing this hypothesis.
5
4. Exercises
In the exercises you will be working with a data set from a study of survival times for patients with stomach
and breast cancer, both treated with ascorbate (Vitamin C). The data is a modified version of the found
in the article Supplemental ascorbate in the supportive treatment of cancer: Re-evaluation of prolongation
of survival times in terminal human cancer by E. Cameron and L. Pauling (1978). The outcome we are
interested in studying is the difference in average survival time in days between the two groups.
1.) Load the cancer data from cancer.csv using read.csv
# Load in cancer data
2.) Create a new column survival.time.log in the data.frame equal to the logarithm of the survival time.
(We make this transformation to make the sample variances roughly equal for the permutation test)
# Define logarithm of survival time
3.) a.) Plot and label a histogram of the log survival time for each type of cancer and combine them using
grid.arrange. Calculate the sample standard deviation of log survival time for each cancer type.
# Plot survival times for each cancer type
b.) Comment on the distribution of the samples.
4.) a.) Perform a permutation test and obtain the p-value of the observed difference in survival times based
on the null hypothesis H0 : µ1 = µ2, where µ1 and µ2 are the mean survival times in log-days for stomach and
breast cancer patients, respectively. Include in-line comments that explain the purpose of your lines of code.
b.) Can a conclusion be drawn about any differences in survival times of the two cancer types? Why or why
not?