Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STA238 Embedded R chunks
In this R Lab, you will be guided by your TA on how to adapt bootstrapping and simulation to construct confidence intervals and perform hypothesis tests on two-sample data when normality assumptions fail. At this point, it is assumed that you will have decent familiarity in creating histograms using ggplot to assess distribution shape. You are expected to have access to R + R Studio + LaTeX in some capacity by this tutorial, whether it be locally installed or accessed through JupyterHub. Note: You are expected to join your tutorial meeting with your U of T licensed Office 365 account. Your TAs will not be letting guests join the meeting! R Lab Learning Goals • Review: Using filter in dplyr (a package already contained in tidyverse) to filter data in dataframe according to some criteria. • Review: Using histograms to gauge normality, and a visual check on equal-variance assumption. • Use empirical bootstrap to generate confidence intervals for the difference in population means (for simplicity, you will use the percentile method instead of the studentized method). • Use permutation test to test for equal population means under assumptions of equal variance. Use the tasks in this R lab to complete the exercises at the end of this document. Some additional resources you may find helpful as you learn is the R Markdown Cheatsheet and the LaTeX Cheatsheet. As you work more in LaTeX, the notation will get easier and become second nature! Submission Instructions Your submission should be completed in a NEW R Markdown file that is set to ‘knit to pdf’ that contains only the exercise portion found at the end of this document. Your final submission should include: • The original questions (copy and paste is fine) • Embedded R chunks • Corresponding outputs for the exercises. Note that you must not include a printout of large data sets! • Required graphs should include informative and clearly communicated axis labels and titles. Adjust your R chunk options to produce reasonably sized graphs in your output • Discussion questions should be answered using proper punctuation and grammar, to the best of your abilities. Treat these responses as a formal write-up, even if it is short! • You must submit both your .rmd and knitted .pdf file on Quercus for your current tutorial assignment as it appears on Quercus. • The grading rubric for the R Lab is available on your R Lab # 4 submission page. 1 -1. Note The presentation of this tutorial is adapted from section 10.6 of Modern Mathematical Statistics with Applications. Refer to this if you are interested in the material presented here as well as non-parametric approaches to inference on paired data and variances! 0. Motivation The data we will be working with today is from a study reported in A Longitudinal Study of the Development of Elementary School Children’s Private Speech. In this study, researchers recorded instances where children were engaged in private speech during a randomly chosen 10 second interval. Private speech refers to speech spoken aloud that is either addressed to oneself or to no particular listener. Many such intervals were observed for each child, and the percentage of intervals where private speech occurred for each child was recorded. One of the outcomes of interest for this study and the focus of our tutorial is the following research question: Does the amount of private speech differ between boys and girls? To put this question into a mathematical context, let (X1, ...Xm) and (Y1, ..., Yn) denote the percentage of private speech observed for boys and girls, respectively. Furthermore, assume that each observation Xi comes from a population with mean µ1 and Yj from a population with mean µ2. This problem can then be viewed as that of making inferences about differences in means µ1 − µ2, or determining whether boys and girls have different engagements in private speech. 1. Exploratory data analysis To begin exploring this question, let us first load in the study data and the relevant libraries for our analysis. library(tidyverse) library(latex2exp) library(gridExtra) # Load in the speech data # Remember to set your working directory and ensure data files are stored in the same location speech <- read.csv("speech.csv") Next we will plot the data for boys and girls separately to get an idea of their distribution. For the purposes of this analysis, the labels boy and girl refer to a child being labeled as male or female in the data set. ########################################### ### Remove the eval=F from the R chunk! ### ########################################### # Plot the samples for boys and girls using gridExtra # Density histograms to compare the density histogram of the two samples plot.boys <- speech %>% ## INPUT ## %>% ggplot(aes(x = percentage, y =..density..)) + #density histogram geom_histogram(binwidth = 0.5, center = 0.25, #bin aesthetics color = "black", fill = "thistle2") + labs(title = "Private speech: Boys", x = "Percentage", y = "Density") + xlim(c(0,30)) #common range 2 plot.girls <- speech %>% ## INPUT ## %>% ggplot(aes(x = percentage, y =..density..)) + geom_histogram(binwidth = 0.5, center = 0.25, color = "black", fill ="thistle2") + labs(title = "Private speech: Girls", x = "Percentage", y = "Density") + xlim(c(0,30)) grid.arrange(plot.boys, plot.girls) From these plots it is clear that the data is not normal and with the limited data size, any confidence intervals or hypothesis tests based on the assumption of normality will be misleading. In such situations, we typically relax the assumption of normal data and turn to non-parametric approaches to form inferences about the quantities of interest. In the following sections, we will demonstrate how confidence intervals can be obtained using a bootstrap approach and how the so-called permutation test can be used to perform hypothesis testing with minimal distributional assumptions. 3 2. Bootstrap confidence interval for the difference in means In this first section we will demonstrate how to construct a confidence interval for the difference in means µ1 − µ2 using a bootstrap approach. The general approach is very similar to that of the one-sample problem: • Draw a bootstrap sample (X∗1 , ..., X∗m) with replacement from the first group (X1, ..., Xm) • Draw a bootstrap sample (Y ∗1 , ..., Y ∗m) with replacement from the second group (Y1, ..., Ym) • Compute the difference in means X¯∗ − Y¯ ∗ for the bootstrap samples • Repeat this process many times to obtain the bootstrapped sampling distribution • Construct a bootstrap confidence interval (C∗lower, C∗upper) from the sampling distribution. In this presentation, we will be using the percentile method to construct the (1 − α)-level confidence intervals in the final step, where C∗lower and C∗upper are taken to be the α/2 and 1− α/2 sample quantiles of the bootstrap sampling distribution, respectively. This method is simple, and makes few assumptions about the shape of the bootstrap sampling distribution, though the resulting interval is typically biased. To see another approach similar to the studentized approach from the previous tutorial, see the discussion in section 10.6 of the textbook. 2.1 Implementation in R Let’s use this recipe to construct a bootstrapped confidence interval in R: ########################################### ### Remove the eval=F from the R chunk! ### ########################################### # Create vectors containing the observed data for each group # Define the bootstrap experiment The 95% bootstrap confidence interval for the difference in means is then ########################################### ### Remove the eval=F from the R chunk! ### ########################################### # Compute the 95% percentile confidence interval ci.mean.diff <- quantile(boot.mean.diff, probs = c(0.025, 0.975)) The corresponding confidence interval based on the assumption of normal data can be calculated to be (0.55, 9.63), which has a smaller lower limit compared to our bootstrapped interval. Given that our data is definitely not normal, we have more confidence in the validity of the bootstrap-based interval. 4 3. Permutation test for hypothesis testing Another common approach to studying the difference in means between two groups is the hypothesis testing approach. Here, we consider the following null and alternative hypotheses: • H0 : µ1 − µ2 = 0 • H1 : µ1 − µ2 6= 0 Since our data is not normally distributed, we can no longer use the two-sample t−test approach, where the null distribution of the sample statistic is a known t−distribution. 3.1 The permutation test If we are comfortable assuming that the only possible difference in the distribution of the two samples is in their means µ1, µ2, then we can use what is called the permutation test to perform this hypothesis test. Under the null hypothesis µ1 = µ2, where both sampling distributions are the same, every observation could equally well belong to either group and so the group labels have no effect on the outcome. Since the group labels are exchangeable, we can randomly assign them to the observed values and we will obtain what looks like a new observation from the null distribution with the same probability as the original sample. By comparing our observed difference X¯ − Y¯ to the difference measured on these new permuted data, we can get a sense of how unlikely it is to observe such a difference under the null hypothesis. Following this reasoning, the recipe for the permutation test is as follows: • Compute the test statistic Tobs = X¯ − Y¯ on the original data (this test statistic is labeled T, but does not imply that it has a T-distribution). • Create a permuted data set by randomly assigning the group labels to the observed values and compute the test statistic T ∗ = X¯∗ − Y¯ ∗. Note: this is not the same as sampling with replacement. Rather we are relabeling each of the data points as a ’male’ or ’female’. • Repeat this process for every permutation of the data (or a very large number of permutations) • Compute p−value of Tobs by computing the proportion of permuted test statistics T ∗ that are at least as “extreme” as Tobs. 3.2 Implementation in R We implement the permutation test for our data as follows: # Compute the observed difference between genders # Permutation test The p−value for the our particular value of (FILL THIS IN) is then # Compute p-value where we have defined “extreme” to mean |T | ≥ |Tobs|, which is a two-tailed test. Therefore we reject the null hypothesis that µ1 = µ2 at the 5% level. Note The assumption that both distributions are the same in every possible way except perhaps the mean is a very strong assumption that should typically be checked. In practice we usually require this be approximately true by checking that the sample variances are roughly equal. If we check this for our the speech data set, we find that this is unlikely satisfied and that we may want to use another approach for testing this hypothesis. 5 4. Exercises In the exercises you will be working with a data set from a study of survival times for patients with stomach and breast cancer, both treated with ascorbate (Vitamin C). The data is a modified version of the found in the article Supplemental ascorbate in the supportive treatment of cancer: Re-evaluation of prolongation of survival times in terminal human cancer by E. Cameron and L. Pauling (1978). The outcome we are interested in studying is the difference in average survival time in days between the two groups. 1.) Load the cancer data from cancer.csv using read.csv # Load in cancer data 2.) Create a new column survival.time.log in the data.frame equal to the logarithm of the survival time. (We make this transformation to make the sample variances roughly equal for the permutation test) # Define logarithm of survival time 3.) a.) Plot and label a histogram of the log survival time for each type of cancer and combine them using grid.arrange. Calculate the sample standard deviation of log survival time for each cancer type. # Plot survival times for each cancer type b.) Comment on the distribution of the samples. 4.) a.) Perform a permutation test and obtain the p-value of the observed difference in survival times based on the null hypothesis H0 : µ1 = µ2, where µ1 and µ2 are the mean survival times in log-days for stomach and breast cancer patients, respectively. Include in-line comments that explain the purpose of your lines of code. b.) Can a conclusion be drawn about any differences in survival times of the two cancer types? Why or why not?