Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CECN 702: Econometrics
Lectures: Material posted Tuesday each week; accessible at your
discretion
R Lab: Tuesdays 8:00-9pm, live via Zoom (be prepared to go over)
Office hours: Fridays 2:30-5pm, via Zoom by drop-in or appointment
1 / 26
What this class is about
▶ Econometrics (II).
▶ In this course, we will continue with our study of the ordinary least squares
regression model, and its important variants, that we began in CECN/ECN 627
▶ In 627, you learned the basics of linear regression
▶ Its theory
▶ Its application in regression software (R or other software)
▶ This course introduces some important variations on the standard ordinary least
squares (OLS) regression model
▶ These variations:
▶ Allow us to work with more complex data structures than we saw in 627
▶ Address issues in estimation that challenge our ability to do causal inference
2 / 26
What this course is about (II)
▶ We will begin with a review of the OLS model – the basics of estimation and
hypothesis testing using multiple regression – from 627
▶ This review will take up this week and next
▶ Then, we will move on to a discussion of how to evaluate multiple regression
studies (Chapter 9) and how to address econometric issues that may plague
these studies
▶ Specifically, we are concerned about the presence of bias (from various sources) in
regression results
▶ Biased estimates mean that a regression analysis isn’t accurately identifying the causal
effect of our independent variable(s) of interest on our outcome or dependent variable
▶ Often, we say this is because our independent variable of interest is endogenous
▶ This part of the course also represents a review of some topics (like omitted
variable bias) we studied in 627, while introducing some new issues that may
complicate regression analysis
3 / 26
What this course is about (III)
▶ From there, we will look at two important ways of solving these econometric
problems and eliminating bias from our estimates:
▶ Instrumental variables regression, and in particular the the two stage least squares
regression model
▶ Using panel data in which each entity we sample is observed at multiple points over time
▶ Doing justice to these topics (Chapters 12 and 10 of the textbook) will take us
about 3-4 weeks, taking us over halfway through the course
▶ In the second half of the course we will turn to looking at some extensions of the
OLS model that allow us to use non-traditional data structures...
4 / 26
What this course is about (IV)
▶ First, we will look at how to do regression when our dependent (outcome) variable
is discreet (typically binary or dummy) rather than continuous (Chapter 11)
▶ In this case, we are looking at the change in the probability that our dependent variable
takes the value of one rather than zero as our independent variables X change
▶ Standard models for binary dependent variable regression include the linear probability,
probit, and logit models
▶ Next, we will look at how regression analysis changes when we are working with
Big Data (Chapter 15)
▶ “Big data” for us means regression with many different independent variables (for which
it is difficult to determine causality/importance), so we need techniques to help generate
valid predictions
▶ Finally, we will conclude with a discussion of forecasting in macroeconomics
(Chapter 16): another application of regression to engage in prediction out of
sample
▶ Recall that the properties and required assumptions underlying regression are different
when we are using regression for prediction rather than for causal inference!
5 / 26
Programming and doing estimation in R
▶ The applied component of 702 is taught in R
▶ R has the major advantage of being a free download for your personal laptops
▶ R has been growing in popularity as an open source alternative to Stata and other
proprietary statistical softwares, so it is a good thing to learn it
▶ Learning R will allow you to work on your assignments from anywhere (in this or
any other empirical class)
▶ Those of you who took ECN/CECN 627 with me have already been introduced to R in
the 627-702 context
▶ For the rest, our first two tutorials (of six) will help catch you up
6 / 26
Evaluation of ECN/CECN 702
▶ Evaluation of 702 will be slightly different than usual:
1. Four assignments: due Saturday before midnight Weeks 4, 7, 10, and 12
▶ If you do all four assignments, they are worth 15% each
▶ You can opt to skip assignment 1 in which case the latter three assignments are worth 20% each
2. Lab attendance (online): 10%
▶ 1 point for showing up (attendance)
▶ 1 point for completing the lab exercise
▶ There are 6 labs and therefore 12 total points available
3. Take home final exam: 30% (August 5)
▶ The assignments will be made available 10 days before they are due (on
Wednesdays the week before)
▶ Make sure you read the CECN 702 syllabus!
7 / 26
Week 1: Review of Regression (Part I)
(Review of ECN/CECN 627, Stock and
Watson Chapters 4-8)
8 / 26
The population regression line
▶ Linear regression is the estimation of the population regression line and its slope
▶ Consider a data set consisting of two random variables which follow some kind of
joint distribution:
▶ Y : the dependent variable or regressand
▶ X : the independent variable or regressor
▶ The population regression line is the expected value of Y given X : E(Y |X)
▶ The estimated regression can be used either for:
1. causal inference (learning about the causal effect on Y of a change in X )
2. prediction (predicting the value of Y given X for an observation not in the data set)
▶ Causal inference and prediction place different requirements on the data, but both
use the same regression toolkit
9 / 26
The population regression line: Definitions
▶ For a randomly drawn sample, (X1,Y1), ..., (Xn,Yn), if we fit a regression line to
the data (using Y as the dependent and X as the independent variable) we have,
for any observation i ,:
Yi = β0 + β1Xi + ui
▶ ...where X is the independent variable (or regressor)
▶ Y is the dependent variable (or regressand)
▶ β0 is the intercept
▶ β1 is the slope
▶ ui is the regression error for observation i
▶ In practical terms, the regression error consists of factors that influence Y but are
omitted from the set of independent variables
▶ everything except regressor X in this case
▶ The regression error also includes error in the measurement of Y
10 / 26
Visualizing the regression line and the error
▶ This example shows a subset n = 7 of the California Test Score (STAR) data set
▶ For each observation i = 1, ..., 7, ui is the vertical distance between the data point
and the regression line
▶ How do we get the parameters for the regression line?
11 / 26
The Ordinary Least Squares Estimator
▶ We use a “least squares” estimator to estimate the population regression equation
▶ the intercept (β0)
▶ the slope (β1)
▶ The most popular and standard version of the regression estimator is the ordinary
least squares (OLS) estimator of the population regression line
▶ In a model with a single independent variable (the “univariate regression”) the
OLS coefficients solve:
min
b0,b1
( n∑
i=1
[
Yi − (b0 + b1Xi )
]2)
...where b0 and b1 are all the candidate parameters
▶ Visually: the OLS estimator minimizes the average squared difference between
the actual values of Yi and the “predicted value” of Yi (typically called Yˆi ), and the
estimates that solve the minimization problem are called βˆ0 and βˆ1
12 / 26
Formulas for βˆ0 and βˆ1
▶ βˆ0 and βˆ1 are the OLS estimates of β0 and β1 estimated from the sample
▶ They are given by:
βˆ1 =
∑n
i=1
[
(Yi − Y )(Xi − X)
]∑n
i=1(Xi − X)2
=
sXY
s2X
...and
βˆ0 = Y − βˆ1X
▶ ... where Y and X are (again) the sample averages of Y and X respectively and
sXY is the sample covariance of X and Y
▶ (Note that in the last step of the βˆ1 expression we are multiplying the numerator
and denominator by 1n−1 to give us
sXY
s2X
)
13 / 26
Visualizing the regression line for the whole STAR sample
▶ Estimated βˆ1 = -2.28
▶ Estimated βˆ0 = 698.9
▶ Estimated regression line: ̂Test score = 698.9− 2.28× STR
14 / 26
Interpreting βˆ0 and βˆ1
▶ The estimated equation from our worked problem was:
̂Test score = 698.9− 2.28× STR
▶ What does this equation tell us?
1. Districts with one more student per teacher on average have test scores that are 2.28
points lower
▶ In other words, ∆E(Test ; score|STR)
∆STR = −2.28
2. Districts with zero students per teacher would have a (predicted) test score of 698.9
▶ Clearly, this (literal) interpretation of the intercept makes no sense – it extrapolates the line
outside the range of the data.
▶ In this worked problem, the intercept is not economically meaningful
15 / 26
Measures of Fit: The R2
▶ Next, we ask how well the estimated regression line “fits” or explains the data
▶ The most commonly referenced measure of fit is the regression R2, which
measures the fraction of the total variance of Y that is explained by X (The
“explained sum of squares”)
▶ “TSS” is the total sum of squares
▶ “ESS” is the explained sum of squares
▶ “RSS” is the residual sum of squares
▶ The R2 is given by
R2 =
ESS
TSS
= 1− RSS
TSS
=
∑n
i=1(Yˆi − Y )2∑n
i=1(Yi − Y )2
▶ When R2 = 1, ESS = TSS
▶ When R2 = 0, ESS = 0
16 / 26
The worked example again: Test scores and class size
▶ Estimated regression line: Test score = −698.9− 2.28× STR
▶ R2 = 0.05
▶ STR is not explaining much of the variance in test scores even though a (linear)
relationship exists
17 / 26
Using Least Squares estimation for causal inference and prediction
▶ Although we will not repeat the proofs here, in 627 we established the conditions
under which βˆ0 and βˆ1 are unbiased and consistent estimates of β0 and β1
▶ In particular: under the Least Squares Assumptions for Causal Inference, βˆ1 is an
unbiased (and consistent) estimate of the true causal effect of X on Y (i.e. β1)
▶ This is also true of βˆ0 for β0 but in general this is less interesting/important
▶ The causal effect on Y of a unit change in X is the expected difference in Y as
measured in a randomized controlled experiment
▶ Experimental data is an ideal, but we usually only have observational data to work with
▶ To demonstrate these properties of the OLS estimator, we established some
important facts about the sampling distribution of βˆ1 in large samples
18 / 26
The Least Squares Assumptions (LSA)s for Causal Inference
▶ Let β1 be the causal effect of a unit change in X on Y :
Yi = β0 + β1Xi + ui , i = 1, ...n
▶ Then:
1. The conditional distribution of u given X has mean zero, that is,
E(u|X = x) = 0
for any value of x
2. (Xi ,Yi ) for i = 1, ..., n are i.i.d. draws from the population
3. Large outliers in X and/or Y are rare
▶ Technically, the fourth moments of X and Y are both finite
19 / 26
Summary of the sampling distribution of βˆ1
▶ If the three Least Squares Assumptions hold, then:
1. The exact, finite-sample distribution of βˆ1 has E(βˆ1) = β1, so βˆ1 is an unbiased
estimate of β1
2. var(βˆ1) =
var
[
(Xi−µX )ui
]
nσ4X
3. βˆ1
p→ β1 as n gets large, so βˆ1 is a consistent estimate of β1
4. Also, when n is large, βˆ1−E(βˆ1)√
var(βˆ1)
∼ N(0, 1) by the Central Limit Theorem and the rule for
standardization of a normal distribution
▶ All of this depends crucially, for points (2)-(4), on n being large
▶ Nearly all of modern econometrics is concerned with large sample estimation
20 / 26
The Least Squares Assumptions for Prediction
▶ If our goal is to engage in prediction rather than causal inference, the Least
Squares Assumptions change slightly:
1. The out of sample observation (X -OOS,Y -OOS) is drawn from the same
distribution as the estimation sample (Xi ,Yi ), i = 1, ..., n
▶ This (new) assumption ensures that the regression line fit using the estimation sample
also applies to the out-of-sample data to be predicted
2. (Xi ,Yi ), i = 1, ..., n are i.i.d.
3. Large outliers in X and/or Y are rare (X and Y have finite fourth moments)
▶ Assumptions #2 and #3 are the same as the corresponding LSAs for causal
inference but #1 changes
21 / 26
Hypothesis testing with OLS
▶ Under LSA #1-#3 for causal inference (i.e. for determining the causal effect of X
on Y in the population):
βˆ1 ∼ N
(
β1,
var
[
(Xi − µX )ui
]
n(σ2X )
2
)
▶ Or, using the fact that βˆ1 is an unbiased estimator of β1 and standardizing:
βˆ1 − E(βˆ1)√
var(βˆ1)
∼ N(0, 1)
...where
var
[
(Xi−µX )ui
]
n(σ2X )
2 ≡ var(βˆ1)
▶ If we compute the sample analogue of var(βˆ1), we have a t-statistic that allows us
to test a null hypothesis such as E(βˆ1) = β1 = 0 as this statistic follows the
standard normal distribution in large samples whenever the null hypothesis is true
22 / 26
Hypothesis testing and the standard error of βˆ1
▶ Consider the following general setup:
1. Null hypothesis and two-sided alternative: H0 : β1 = β1,0 vs. H1 : β1 ̸= β1,0
2. Null hypothesis and one-sided alternative: H0 : β1 = β1,0 vs. H1 : β1 < β1,0
▶ In both cases, β1,0 is the value β1 takes under the null (i.e. if the null is true)
▶ For testing a hypothesis β1 = β1,0:
t =
βˆ1 − β1,0
SE(βˆ1)
...where SE(βˆ1) is the square root of an estimate of the variance of the sampling
distribution of βˆ1
▶ The formula for SE(βˆ1) is in general messy (we derived it in 627) but easily
computed for us by statistical software like Stata or R
23 / 26
Hypothesis testing and confidence intervals
▶ Construct the t-statistic and call it tact since it is computed from the actual sample:
tact =
βˆ1 − β1,0
SE(βˆ1)
▶ Reject the null hypothesis of β1 = β1,0 at 5% significance level if |tact | > 1.96 (for
a two-sided test)
▶ The p-value is Pr [|t | > |tact |] = probability in tails of the standard normal
distribution outside |tact |
▶ Recall you reject at the 5% significance level if the p-value is < 5%
▶ We can also use this information to construct a 95% confidence interval for β1:
95% confidence interval for β1 = {βˆ1 ± 1.96× SE(βˆ1)}
24 / 26
Back to our worked example (1 of 2)
▶ Recall our estimated regression line was:
̂Test score = 698.9− 2.28× STR
▶ Regression software (STATA or R) reports the standard errors:
SE(βˆ0) = 10.4; SE(βˆ1) = 0.52
▶ So our t-test for H0: β1 = β1,0 = 0 is given by
t =
βˆ1 − β1,0
SE(βˆ1)
=
−2.28− 0
0.52
= −4.38
▶ The 1% 2-sided significance level is 2.58, so we reject the null at the 1%
significance level
▶ Of course, this also means we reject it at the 5% level and any other higher level
▶ Alternatively, we can compute the p-value from our tact directly
25 / 26
Back to our worked example (2 of 2)
▶ The p-value based on the large-n standard normal approximation to the t-statistic
is 0.00001 (10−5)