Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Power Calculation
• Suppose, for illustation, that we are interested in testing the hypothesis H0: ✓1 = ✓2 vs. HA: ✓1 6= ✓2 • Suppose, also for illustration, that the test statistic associated with this test has the form T = (Y 1 Y 2) (✓1 ✓2)q Var[Y1] n + Var[Y2] n ⇠ N(0, 1) • It will be useful to define the notion of a rejection region R: all values of the observed test statistic t that would lead to the rejection of H0: R = {t | H0 is rejected} – If t 2 R, we reject H0 – If t 2 Rc, we do not reject H0 1
• Defining Type I and Type II error rates in terms of a rejection region is also useful: ↵ = Pr(Type I Error) = Pr(Reject H0 | H0 is true) = Pr(T 2 R | H0 is true) = Pr(Type II Error) = Pr(Do Not Reject H0 | H0 is false) = Pr(T 2 Rc | H0 is false) 2 3 Permutation and Randomization Tests • All of the previous tests have made some kind of distributional assumption for the response measure- ments • It would be preferable to have a test that does not rely on any assumptions • This is precisely the purpose of permutation and randomization tests. – These tests are nonparametric and rely on resampling. – The motivation is that if H0 : ✓1 = ✓2 is true, any random rearrangement of the data is equally likely to have been observed. – With n1 and n2 units in each condition, there are✓ n1 + n2 n1 ◆ = ✓ n1 + n2 n2 ◆ arrangements of the n1 + n2 observations into two groups of size n1 and n2 respectively 4 • A true permutation test considers all possible rearrangements of the original data – The test statistic t is calculated on the original data and on every one of its rearrangements – This collection of test statistic values generate the empirical null distribution • A randomization test is carried out similarly, except that we do not consider all possible rearrange- ments – We just consider a large number N of them Randomization Test Algorithm 1. Collect response observations in each condition. 2. Calculate the test statistic t on the original data. 5 3. Pool all of the observations together and randomly sample (without replacement) n1 observations which will be assigned to “Condition 1” and the remaining n2 observations are assigned to “Condition 2”. Repeat this N times. 4. Calculate the test statistic t?k on each of the “shu✏ed” datasets, k = 1, 2, . . . , N . 5. Compare t to {t?1, t?2, . . . , t?N}, the empirical null distribution and calculate the p-value: p-value = # of t?’s that are at least as extreme as t N Example: Pokemon Go • Suppose that Niantic is experimenting with two di↵erent promotions within Poke´mon Go: – Condition 1: Give users nothing – Condition 2: Give users 200 free Poke´coins – Condition 3: Give users a 50% discount on Shop purchases • In a small pilot experiment n1 = n2 = n3 = 100 users are randomized to each condition • For each user, the amount of real money (in USD) they spend in the 30 days following the experiment is recorded • The data summaries are: – y1 = $10.74, Q1(0.5) = $9 – y2 = $9.53, Q2(0.5) = $8 – y3 = $13.41, Q3(0.5) = $10 6 3 Experiments with More than Two Conditions 3.1 Anatomy of an A/B/m Test • We now consider the design and analysis of an experiment consisting of more than two experimental conditions – or what many data scientists broadly refer to as “A/B/m Testing”. – Canonical A/B/m test: Figure 1: Button-Colour Experiment • Other, more tangible, examples: – Netflix – Etsy • Typically the goal of such an experiment is to decide which condition is optimal with respect to some metric of interest ✓. This could be a – mean – proportion – variance – quantile – technically any statistic that can be calculated from sample data • From a design standpoint, such an experiment is very similar to a two-condition experiment 1. Choose a metric of interest ✓ which addresses the question you are trying to answer 2. Determine the response variable y that must be measured on each unit in order to estimate b✓ 3. Choose the design factor x and the m levels you will experiment with. 4. Choose n1, n2, . . . , nm and assign units to conditions at random 5. Collect the data and estimate the metric of interest in each condition: b✓1, b✓2, . . . , b✓m 7 • Determining which condition is optimal typically involves a series of pairwise comparisons • But it is useful to begin such an investigation with a gatekeeper test which serves to determine whether there is any di↵erence between the m experimental conditions. Formally, such a question is phrased as the following statistical hypothesis. H0: ✓1 = ✓2 = · · · = ✓m versus HA: ✓j 6= ✓k for some j 6= k (1) 3.2 Comparing Multiple Means with an F -test • We assume that our response variable follows a normal distribution and we assume that the mean of the distribution depends on the condition in which the measurements were taken, and that the variance is the same across all conditions. • The “gatekeeper” test for means is tested using an F -test • In particular, we use the F -test for overall significance in an appropriately defined linear regression model : – The appropriately defined linear regression model in this situation is one in which the response variable depends on m 1 indicator variables: xij = ( 1 if unit i is in condition j 0 otherwise for j = 1, 2, . . . ,m 1. – For a particular unit i, we adopt the model Yi = 0 + 1xi1 + 2xi2 + · · ·+ m1xi,m1 + "i 8 – In this model the ’s are unknown parameters and may be interpreted in the context of the following expectations: E[Yi|xi1 = xi2 = · · · = xi,m1 = 0] = 0 E[Yi|xij = 1] = 0 + j – Based on these assumptions, H0 in (1) is true if and only if 1 = 2 = · · · = m1 = 0. Thus testing (1) is equivalent to testing H0: 1 = 2 = · · · = m1 = 0 vs. HA: j 6= 0 for some j – This hypothesis corresponds, as noted, to the F -test for overall significance in the model. • In regression parlance, the test statistic is defined to be the ratio of the regression mean squares (MSR) to the mean squared error (MSE) in a standard regression-based analysis of variance (ANOVA): t = MSR MSE • In our setting we can more intuitively think of the test statistic as comparing the response variability between conditions to the response variability within conditions: 9 • The null distribution for this test is F(m1,Nm) • The p-value for this test is calculated by p-value = P (T t) where T ⇠ F(m1,Nm) • Example: Candy Crush Boosters – Candy Crush is experimenting with three di↵erent versions of in-game “boosters”: the lollipop hammer, the jelly fish, and the color bomb. Figure 2: Candy Crush Experiment – Users are randomized to one of these three conditions (n1 = 121, n2 = 135, n3 = 117) and they receive (for free) 5 boosters corresponding to their condition. Interest lies in evaluating the e↵ect of these di↵erent boosters on the length of time a user plays the game. – Let µj represent the average length of game play (in minutes) associated with booster condition j = 1, 2, 3. While interest lies in finding the condition associated with the longest average length of game play, here we first rule out the possibility that booster type does not influence the length of game play (i.e., µ1 = µ2 = µ3). – In order to do this we fit the linear regression model Y = 0 + 1x1 + 2x2 + " where x1 and x2 are indicator variables indicating whether a particular value of the response was observed in the jelly fish or color bomb conditions, respectively. The lollipop hammer is therefore the reference condition. 10 Optional Exercises: • Calculations: 2, 7 • Proofs: 1, 5, 6, 9, 10, 14, 17, 18 • R Analysis: 2, 5, 6, 8, 13(g), 17 (not g,h), 22(h), 23(a-f)