STA 238 Probability, Statistics and Data Analysis
Probability, Statistics and Data Analysis
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STA 238
Probability, Statistics and Data Analysis II
Learning Outcomes
By the end of this lecture, we will have covered how to...
Construct confidence intervals for estimating population variance σ2
for normal data
Construct confidence intervals for estimating population proportion p
Additional examples of confidence intervals
Implement bootstrapping to construct confidence intervals when
working with estimators with unknown sampling distributions
Confidence Interval - Example (Last Week)
Example 1 MIPS 24.8 A bottling machine is known to fill wine bottles with
amounts that follow an N(µ, σ2) distribution, with σ = 5. Additionally, the wine
bottles involved are normally distributed and weigh an average of 250 grams with
a standard deviation of 15 grams. For a sample of 16 bottles, an average weight of
998 grams was found. You may assume that 1 mL wine = 1 g. Construct a 95%
confidence interval for the expected amount of wine per bottle, µ, and interpret.
Other Confidence Intervals
In our course, we have also studied the following estimators with known sampling
distributions:
Sample variance S2 as an estimator for population variance σ2 when working
with normally distributed data. We know that
(n − 1)S2
σ2
∼ χ2df=n−1
Sample proportion pˆ as an estimator for some population probability of
success p when we have large enough n such that np ≥ 10 and
n(1− p) ≥ 10. We know that pˆ ∼˙N
(
p, p(1−p)n
)
.
March 6, 2022 4 / 15
Deriving CI for σ2
If we have normally distributed data, we can estimate σ2 with the sample
variance. Let’s derive the confidence interval for σ2 like we did the previous week:
March 6, 2022 5 / 15
Confidence Interval Example
Example 2 Bavarian forest sulfur dioxide concentration question from last
week:
x¯24 = 53.9167 s = 10.0737 s2 = 101.4797
Use the data to estimate the variance in sulfur dioxide concentration in
the Bavarian forest and construct a 98% confidence interval for your
estimate.
March 6, 2022 6 / 15
Deriving CI for p - Wilson Method (Accurate)
If we have a sufficiently large sample size of Bernoulli(p) trials, Central Limit
Theorem provides us with an approximate sampling distribution for pˆ.
pˆ ∼˙N
(
p, p(1− p)n
)
Following a similar derivation for CI from last week, let’s construct the confidence
interval by first finding a probability interval of pˆ − p that has a (1− α)100%
probability of occurring:
P
−zα/2 < pˆ − p√
p(1−p)
n
< zα/2
≈ 1− α
For a given pˆ, solving for p in the event:
(pˆ − p)2
p(1−p)
n
< z2α/2
will give us the corresponding (1− α)100% confidence interval.
March 6, 2022 7 / 15
Deriving CI for p - Wilson Method (Accurate)
If we have a sufficiently large sample size of Bernoulli(p) trials, Central Limit
Theorem provides us with an approximate sampling distribution for pˆ.
pˆ ∼˙N
(
p, p(1− p)n
)
Following a similar derivation for CI from last week, let’s construct the confidence
interval by first finding a probability interval of pˆ − p that has a (1− α)100%
probability of occurring:
P
−zα/2 < pˆ − p√
p(1−p)
n
< zα/2
≈ 1− α
For a given pˆ, solving for p in the event:
(pˆ − p)2
p(1−p)
n
< z2α/2
will give us the corresponding (1− α)100% confidence interval.
March 6, 2022 7 / 15
Deriving CI for p
Example 3 - MIPS p. 362 Suppose in a sample of 125 voters, 78 support one
candidate. What is the 95% confidence interval for the population proportion p
supporting that candidate?
Our data: Realization of X is x = 78 on a sample of n = 125, or a
realization of pˆ = 78/125 = 0.624.
Z-score for 95% CI: α = 0.05, z0.025 = 1.96
Finding the corresponding values of p that satisfy the following event:
P
−1.96 < pˆ − p√
p(1−p)
n
< 1.96
≈ 0.95
means solving for the event:
(0.624− p)2
p(1−p)
125
< 1.962
March 6, 2022 8 / 15
Deriving CI for p
(0.624− p)2
p(1−p)
125
< 1.962
(0.624− p)2 < 1.96
2
125 · p(1− p)
0.6242 − 1.248p + p2 < 1.96
2
125 p −
1.962
125 p
2
1.0307p2 − 1.2787p + 0.3894 < 0
Recall for any quadratic equation: ax2 + bx + c = 0, the quadratic formula will
return the roots of the equation:
x =
− b ±√b2 − 4ac
2a
March 6, 2022 9 / 15
Deriving CI for p - Method 2 (Less Accurate)
An alternative way to compute confidence intervals was proposed by Agresti an
Coull, which results in a more conservative (wider) interval than the method
described above, especially when p is near 0 or 1.
Instead of treating both p and variance p(1− p)/n as unknown, we estimate the
variance using the sample proportion: V̂ (pˆ) =
pˆ(1− pˆ)
n . Then:
P
−zα/2 < pˆ − p√ pˆ(1− pˆ)
n
< zα/2
≈ 1− α
can be rearranged to yield the random interval:pˆ − zα/2
√
pˆ(1− pˆ)
n , pˆ + zα/2
√
pˆ(1− pˆ)
n
Note: The interval can be made even more conservative by using p = 0.5 in the
variance estimate in place of pˆ. Why?
March 6, 2022 10 / 15
Deriving CI for p - Method 2
Example 4 - MIPS p. 362 Suppose in a sample of 125 voters, 78 support one
candidate. What is the 95% confidence interval for the population proportion p
supporting that candidate?
March 6, 2022 11 / 15
What if CLT does not apply?
Problem: If data is not normal, using x¯n ± tdf ,α/2
s√n can lead to
inaccurate results. We need the distribution of the studentized mean:
Xn − µ
Sn/
√n in order to find the critical values that encompass (1− α)100% of
random deviations to the true mean.
Solution: Estimate these critical values through bootstrapping. Since the
underlying distribution is unknown, use the ECDF of the collected data Fn
as an approximation to F . The expectation of Fn is µ∗ = xn:
March 6, 2022 12 / 15
What if CLT does not apply?
Problem: If data is not normal, using x¯n ± tdf ,α/2
s√n can lead to
inaccurate results. We need the distribution of the studentized mean:
Xn − µ
Sn/
√n in order to find the critical values that encompass (1− α)100% of
random deviations to the true mean.
Solution: Estimate these critical values through bootstrapping. Since the
underlying distribution is unknown, use the ECDF of the collected data Fn
as an approximation to F . The expectation of Fn is µ∗ = xn:
March 6, 2022 12 / 15
What if CLT does not apply?
In the case where the sampling distribution cannot be derived or simulated, we
can rely on bootstrapping to construct approximate confidence intervals. Let’s
examine the case for bootstrapped CI for mean µ:
Empirical Bootstrap of Studentized Mean
Given a dataset x1, x2, ..., xn, determine its ECDF: Fn as an estimate of F. The
expectation corresponding to Fn is µ∗ = x¯n.