Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ECON 7310 Elements of Econometrics
Lecture 3: Linear Regression with One Regressor –
Hypothesis Tests and Confidence Intervals
Outline
▶ Hypothesis tests concerning β1
▶ Confidence intervals for β1
▶ Regression when X is binary
▶ Heteroskedasticity and homoskedasticity
▶ Efficiency of OLS and the Student t distribution
2 / 29
A big picture review of where we are going
We want to learn about the slope of the population regression line. We have
data from a sample, so there is sampling uncertainty. There are five steps
towards this goal:
1. State the population object of interest
2. Provide an estimator of this population object
3. Derive the sampling distribution of the estimator (this requires certain
assumptions). In large samples this sampling distribution will be normal
by the CLT.
4. The square root of the estimated variance of the sampling distribution is
the standard error (SE) of the estimator
5. Use the SE to construct t-statistics (for hypothesis tests) and confidence
intervals.
3 / 29
Object of interest: β1
Yi = β0 + β1Xi + ui , i = 1, . . . , n
where β1 = ∆Y/∆X , for an autonomous change in X (causal effect)
▶ Estimator: the OLS estimator β̂1.
▶ The Sampling Distribution of β̂1:
To derive the large-sample distribution of β1, we make the following
assumptions:
▶ The Least Squares Assumptions:
1. E(u|X = x) = 0.
2. (Xi ,Yi ), i = 1, . . . , n, are i.i.d.
3. Large outliers are rare (E(X4) <∞, E(Y 4) <∞).
4 / 29
Hypothesis Testing of β1 Section 5.1
▶ The objective is to test a hypothesis, like β1 = 0, using data – to reach a
tentative conclusion whether the (null) hypothesis is correct or incorrect.
▶ General Setup
▶ Null hypothesis and two-sided alternative:
H0 : β1 = β01 vs. H1 : β1 ̸= β01
where β01 is the hypothesized value under the null.▶ Null hypothesis and one-sided alternative:
H0 : β1 = β01 vs. H1 : β1 < β
0
1
▶ We are going to use asymptotic distribution of β̂1, i.e., Under the Least
Squares Assumptions, for n large,
β̂1 − β1
SE(β̂1)
approx∼ N (0, 1)
5 / 29
General approach: construct t-statistic, and compute p-value (or
compare to the N(0,1) critical value)
▶ In general:
t =
estimator− hypothesized value
standard error of the estimator
where the SE of the estimator is the square root of an estimator of the
variance of the estimator.
▶ For testing the mean of Y :
t =
Y − µ0Y
SE(Y )
▶ For testing β1:
t =
β̂1 − β01
SE(β̂1)
6 / 29
Asymptotic Distribution of t
▶ Recall that SE(β̂1) is a consistent estimate of
√
V (β̂1). We discussed
about V (β̂1) in the previous lecture (but did not derive it).
▶ Econometrics softwares such as R and Stata report standard errors.
(We will not derive SE(β̂1).)
▶ We use the fact that when n is large,
t =
β̂1 − β01
SE(β̂1)
approx∼ N (0, 1)
under H0, i.e., if the true value of β1 is really β01 .
7 / 29
To test H0 : β1 = β01 vs. H0 : β1 ̸= β01
▶ Construct the t-statistic, i.e., t = (β̂1 − β01)/SE(β̂1) and choose the level
of significance α. Suppose we choose α = 0.05. Then, make a decision
on whether to reject H0, using any of the following criteria
1. Reject H0 if |t | > 1.96
2. Reject H0 if p-value < α.
3. Reject H0 if β01 is outside the 95% confidence interval,
β̂1 ± 1.96× SE(β̂1) = [β̂1 − 1.96SE(β̂1), β̂1 + 1.96SE(β̂1)]
▶ This procedure relies on the large n approximation that t is normally
distributed under H0; typically n = 50 is large enough for the
approximation to be adequate
▶ If α = 0.01, use 2.576 instead of 1.96
8 / 29
Example: TestScores and STR, California data
▶ We test H0 : β1 = 0 at α = 0.05. R output is given as
▶ We reject H0 because
1. |t | = |(−2.28− 0)/0.52| = 4.38 > 1.96
2. p-value = 0.000 < α=0.05
3. The 95% CI = [−3.30,−1.26] does not include the hypothesised value 0.
▶ We could do similar test for the intercept β0 using the R output.
9 / 29
Some comments on p-values and CIs
▶ For a given t-statistic, the p-value measures Pr(|Z | > |t |) where
Z ∼ N (0, 1). We can consider the p-value as the smallest α at which we
reject H0.
▶ The (1− α)× 100% confidence interval is the set of parameter values
that cannot be rejected by two-sided test at significance level of α.
▶ Consider a repeated sampling, i.e., every day we collect a sample of the
same size. Then, each day, we will have sample (data) so that we have
different estimates, different SE, and different 95% CI.
▶ In the repeated sampling, the 95% CI would contain the true parameter
(β1) in 95% of times.
10 / 29
Regression when X is Binary (Section 5.3)
▶ Sometimes a regressor is binary:
▶ X = 1 if small class size, X = 0 if not
▶ X = 1 if female, X = 0 if male
▶ X = 1 if treated (experimental drug), X = 0 if not
▶ Binary regressors are sometimes called “dummy” variables.
▶ So far, β1 has been called a “slope”, but that doesn’t make sense if X is
binary.
▶ How do we interpret regression with a binary regressor?
11 / 29
Interpreting regressions with a binary regressor
Yi = β0 + β1Xi + ui , i = 1, . . . , n
where Xi is binary (Xi = 0 or Xi = 1)
▶ Linear regression = we are estimating the conditional expectation
function E [Y |X ] under the assumption that E [Y |X ] = β0 + β1X
▶ When Xi = 0, we have Yi = β0 + ui . That is, the mean of Yi is β0,
E [Y |X = 0] = β0
▶ When Xi = 1, we have Yi = β0 + β1 + ui . That is, the mean of Yi is
β0 + β1,
E [Y |X = 1] = β0 + β1
▶ So
β1 = E [Y |X = 1]− E [Y |X = 0]
= population difference in group means
12 / 29
Example: TestScore and STR
▶ But, suppose we observe Di = 1 if STRi < 20 and Di = 0 if STRi ≥ 20
▶ OLS regression:
TestScorei = 650.0 + 7.4 Di
(1.3) (1.8)
▶ Tabulation of group means:
Class Size Average score Standard deviation n
(Y ) (SDev(Y ))
Small (STRi < 20) 657.4 19.4 238
Large (STRi ≥ 20) 650.0 17.9 182
▶ Difference in means: Y small − Y large = 657.4− 650.0 = 7.4
▶ Standard error: SE =
√
s2s
ns
+
s2
ℓ
nℓ
=
√
19.42
238 +
17.92
182 = 1.8
13 / 29
regression when Xi is binary (0/1)
▶ β0 = mean of Y when X = 0
▶ β0 + β1 = mean of Y when X = 1
▶ β1 = difference in means between those two groups.
▶ This is another way (an easy way) to do difference-in-means analysis,
and we can construct t-statistics and confidence intervals as usual
▶ The regression formulation is especially useful when we have additional
regressors (as we will see next week)
14 / 29
Homoskedasticity vs. Heteroskedasticity (Section 5.4)
1. What are homoskedasticity and heteroskedasticity?
▶ The error u is said to be homoskedasticity if V (u|X = x) is constant.
▶ The error u is said to be heteroskedasticity, otherwise.
2. Consequences of homoskedasticity
3. Implication for computing standard errors
15 / 29
Example: hetero/homoskedasticity in the case of a binary regressor
▶ Standard error when group variances are unequal:
SE =
√
s2s
ns
+
s2ℓ
nℓ
▶ Standard error when group variances are equal:
SE = sp
√
1
ns
+
1
nℓ
where sp = “pooled estimator of the standard deviation” (Section 3.6)
▶ Equal group variances = homoskedasticity
▶ Unequal group variances = heteroskedasticity
16 / 29
Heteroskedasticity in a picture:
▶ This shows the conditional distribution of test scores for three different
class sizes.
▶ The distributions become more spread out (have a larger variance) for
larger class sizes.
▶ Because V (u|X = x) depends on x , u is heteroskedastic.
17 / 29
Another example:
A real-data example from labor economics: average hourly earnings vs.
years of education (data source: Current Population Survey):
Heteroskedastic or homoskedastic?
18 / 29
The class size data:
Heteroskedastic or homoskedastic?
19 / 29
So far we have (without saying so) assumed that u might be
heteroskedastic
Recall the three least squares assumptions:
1. E(u|X = x) = 0.
2. (Xi ,Yi), i = 1, . . . , n, are i.i.d.
3. Large outliers are rare (E(X 4) <∞, E(Y 4) <∞).
Heteroskedasticity and homoskedasticity concern V (u|X = x). Because we
have not explicitly assumed homoskedastic errors, we have implicitly allowed
for heteroskedasticity.
20 / 29
What if the errors are in fact homoskedastic?
▶ You can prove that OLS has the lowest variance among estimators that
are linear in Y ; This is a result called the Gauss-Markov theorem that we
will return to shortly.
▶ The formula for β̂1 stays the same. But, the formula for the variance of β̂1
becomes simpler. So does the formula of SE(β̂1).
▶ Homoskedasticity-only standard errors are valid only if the errors are
homoskedastic.
▶ The usual standard errors – to differentiate the two, it is conventional to
call these heteroskedasticity robust standard errors, because they
are valid whether or not the errors are heteroskedastic.
21 / 29
Two standard errors? Practical Implications
▶ The main advantage of the homoskedasticity-only standard errors is that
the formula is simpler. But the disadvantage is that the formula is only
correct if the errors are homoskedastic.
▶ Errors are likely to be heteroskedastic in almost all economic data. So, if
you use homoskedasticity-only standard errors, your inference (testing,
confidence intervals) will be very likely to be wrong.
▶ Even if data are homoskedastic (again, very unlikely), heteroskedasticity
robust standard errors are still valid.
▶ Hence, you are recommended to always use heteroskedasticity robust
standard errors, especially for micro-level data.
22 / 29
Heteroskedasticity-robust standard errors in R
▶ The R command lm_robust computes heteroskedasticity-robust
standard errors, which is available in R package estimatr.
▶ There are different ways of heteroskedasticity-robust standard errors. In
this course, we use the version that another software Stata is using;
se_type = “stata”, the same as se_type = “HC1”
23 / 29
Some Additional Theoretical Foundations of OLS (Section 5.5)
▶ We have already learned a lot about OLS:
▶ OLS is unbiased and consistent;
▶ we have a formula for heteroskedasticity-robust standard errors; and
▶ we can construct confidence intervals and test statistics.
▶ Also, a very good reason to use OLS is that everyone else does – so by
using it, others will understand what you are doing.
▶ Still, you may wonder?
▶ Is this really a good reason to use OLS? Aren’t there other estimators that
might be better – in particular, ones that might have a smaller variance?
▶ Also, what happened to our old friend, the Student t distribution?
▶ So we will now answer these questions
24 / 29
Efficiency of OLS estimators
▶ Let’s consider extended least square assumptions;
1. E(u|X = x) = 0.
2. (Xi ,Yi ), i = 1, . . . , n, are i.i.d.
3. Large outliers are rare (E(X4) <∞, E(Y 4) <∞).
4. u is homoskedastic
5. u is normally distributed.
▶ Gauss-Markov theorem: Under extended LS assumptions 1–4 (the
basic three, plus homoskedasticity), β̂1 has the smallest variance among
all linear unbiased estimators1 (unbiased estimators that are linear
functions of Y1, . . . ,Yn).
▶ Optimality of OLS: Under extended LS assumptions 1–5, β̂1 has the
smallest variance of all consistent estimators (linear or nonlinear
functions of Y1, . . . ,Yn), as n→∞.
1 i.e., OLS estimator is the best linear unbiased estimator (BLUE)
25 / 29
Some not-so-good thing about OLS
▶ The foregoing results are impressive, but these results – and the OLS
estimator – have important limitations.
1. The GM theorem really isn’t that compelling:
▶ The condition of homoskedasticity is not plausible for many economic data
▶ The result is only for linear estimators – only a small subset of estimators (more
on this in a moment)
2. The optimality result requires homoskedastic normal errors – not plausible in
applications
3. OLS is more sensitive to outliers than some other estimators. In the case of
estimating the population “central tendency”, if there are big outliers, then
the median is preferred to the mean because the median is less sensitive to
outliers
▶ In almost all applied regression analysis, OLS is used – and that is what
we will do in this course, too.