Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ECMT2150 – Lecture 6
Topics Today
Week 6
Model Specification
Specification errors
=> including irrelevant variables
=> omitting relevant variables – Omitted Variable Bias
More on multicollinearity and sampling variance
Endogeneity
Proxy Variables
Reference: Chapters 3.3b; 3.4a; Chp 9.2
Recap
• Econometrics is about estimating economic relationships, e.g.
between wages and education, etc.
• Simple and multiple linear regression – many examples
– Incorporated non-linearities: logs & polynomials
– Incorporated qualitative and categorical information
– Discussed goodness of fit
Recap
• Derived the OLS estimator – understand the steps/the
intuition. Derivation is not examinable
• Statistical Properties of OLS that hold for any sample under
given assumptions
– Expected values/unbiasedness under MLR.1 – MLR.4
– Variance formulas under MLR.1 – MLR.5
– Gauss-Markov Theorem under MLR.1 – MLR.5
– Exact sampling distributions/tests under MLR.1 – MLR.6
• Asymptotic Properties of OLS (Consistency and Asymptotic
normality)
Recap
Why?
• Importance of the Zero Conditional Mean Assumption:
E(u|x) = E(u) = 0
• Reliability of estimator and inference rest on whether the
assumptions hold
• Causality - understanding what we have to assume to obtain
causal estimates of our parameters of interest
• + we need the variance formulas and sampling distributional
assumptions to conduct inference
Recap
• Inference
– Hypothesis tests
• T-tests:
– One-sided
– Two sided (tests of statistical significance)
• p-values
• Confidence intervals
– Testing more general alternatives
• An estimate is equal to a constant
• One estimate is equal to another => Linear combination of
parameters
• Testing multiple linear restrictions
Specification Errors
Specification errors
• Until now we have assumed our population model
= 0 + 11 + … + +
has been correctly specified
• A bit unrealistic – we can never be sure of the true
population model
• Types of specification errors:
– Choice of independent variables
• Over-specification & sampling variance effects
• Omitted variables & Endogeneity
– Heteroskedasticity
– Functional Form
– Measurement Error
– Missing Data, Non-random Sampling & Outliers
Misspecification in the choice of
independent variables
Misspecification in the choice of independent variables
A. Including irrelevant variables in a regression
model (over-specifying)
– This model satisfies MLR.1-MLR.4
– x3 may be correlated with x1 and x2
– Crucially, in the population, x3 has no effect on y after we
control for x1 and x2
⇒ Inclusion of x3 has no cost in terms of bias in the estimates of
any of the parameters, because
– However, including irrelevant variables may increase the
sampling variance (more on this shortly)
= 0 in the population
Misspecification in the choice of independent variables
B. Omitting relevant variables (Wooldridge 3.3b)
⇒Omitted Variable Bias
⇒Violate MLR.4, E(u|x)=0
The simple case:
but due to our ignorance or data unavailability, we estimate
True model (contains x1 and x2)
Estimated model (x2 is omitted)
Misspecification in the choice of independent variables
B. Omitting relevant variables (Wooldridge 3.3b)
Example: Omitting ability in a wage equation
True model:
We estimate:
Omitting a relevant variable causes bias when the omitted
variable is correlated with any of the other explanatory variables
in the model
Omitted Variable Bias – the simple case
Let‘s look at this in more detail:
• If x1 and x2 are correlated, assume a linear regression
relationship between them:
• The true model is:
If y is only regressed
on x1 this will be the
estimated intercept
If y is only regressed
on x1, this will be the
estimated slope on x1
error term
And the bias =
Conclusion: All estimated coefficients will be biased.
Omitted Variable Bias – the simple case
Our example again: Omitting ability in a wage equation
Will be positive
The return to education 1 will be over-estimated because 21 > 0.
It will look as if people with many years of education earn very high
wages, but this is partly due to the effect of ability - the fact that
people with more education are also more able on average.
Omitted Variable Bias – the simple case
Summarising the direction of the bias:
Omitted Variable Bias – the simple case
What about the size of the bias?
Our example again: Omitting ability in a wage equationln = 0 + 1 + 2 +
As above, the return to education 1 will be over-estimated
because 21 > 0.
By how much?
For example, if the return to educ in the population is 8.6%
• a bias of 21= 0.1 percentage points – not so worrying
• a bias of 21= 3 percentage points – big concern
Omitted Variable Bias – the simple case
Q: When is there no omitted variable bias?
A: When the omitted factors are
i) unrelated => 1, 2 = 0, 1 = 0
ii) when they don‘t affect the outcome => 2 = 0
In our wages example:
• 1, 2 > 0 if individuals with high innate ability
tend to have higher education.
• 2 > 0 if individuals with high ability tend to have higher
productivity and wages.
Together, we would expect that OLS overestimates 1
Omitted Variable Bias: more general cases
(Wooldridge 3.3c)
– No general statements possible about direction of bias
– If 1, 3 ≠ 0, we can analyse as per the simple
case if and only if 2is uncorrelated with 1 and 3
others
• Example: Omitting ability in a wage equation
True model (contains x1, x2,
and x3)
Estimated model (x3 is omitted)
If experience is approximately uncorrelated with educ and abil, then the
direction of the omitted variable bias can be as analyzed in the simple two
variable case.
Omitting relevant variables =>
Inconsistency
• Not only is the OLS estimator biased when we omit
relevant variables, it is also inconsistent
see section 5.1a, Wooldridge
• We can show that:
�1 = 1 + 2 ∑ 1 − ̅1 2∑( 1 − ̅1 2) + ∑ 1 − ̅1 ∑( 1 − ̅1 2)
• Then taking the plim on both sides, we have:
�1 = 1 + 2 (1, 2)(1)
Sampling Variances
Specification Errors and Multi-collinearity
• So omitting a relevant variable can cause bias.
• Including irrelevant variables does not cause bias
• BUT, you might consider leaving them out so as to not
unnecessarily inflate the sampling variance
⇒there is a trade-off to be made with the effect on the
variance of our slope parameters
Sampling Variance: Mis-specified Models
(Wooldridge 3.4b)
• The choice of whether to include a particular variable in a
regression can be made by analyzing this trade-off
between bias and variance
• It might be the case that the likely omitted variable bias in
the misspecified model 2 is compensated by a smaller
variance
True population
model
Estimated model 1
Estimated model 2
Sampling Variance: Misspecified Models
(Wooldridge 3.4b)
• Variance in the misspecified model
• Case 1:
• Case 2:
Conditional on x1 and x2,
the variance in model 2
is always smaller than
that in model 1
Conclusion: Do not include irrelevant regressors
Trade off bias and variance
Sampling Variance: Misspecified Models
Caution!! Bias will not vanish even in large samples. But the variance
of � will decrease with a large sample
Multicollinearity & sampling variance
– Recall:
– Linear relationships between explanatory variables can create
problems
– High multicollinearity can occur when R2j is ‘close’ to 1
– Ideally, we have little correlation between xj and other
independent variables – Yet this may not be the case.
– For example, examining the effect of various school expenditure
categories on school performance
• It is expected that wealthier schools will spend more on everything than less
wealthy schools
• It can be difficult to estimate the effect of any category of school expenditure on
student performance when there is little variation in one category
Multicollinearity & sampling variance
Average standardized
test score of a school Expenditure
on teachers
Expenditure on
instructional
materials
Other expenditures
The different expenditure categories will be strongly correlated: if a school has a
lot of resources it will spend a lot on everything.
It will be hard to estimate the differential effects of different expenditure
categories because all expenditures are either high or low.
For precise estimates of the differential effects, one would need information
about situations where expenditure categories change differentially.
So,... often the sampling variance of the estimated effects will be large.
Multicollinearity & sampling variance
• Because effects cannot be disentangled, it may be better to lump
all expenditure categories together
• In other cases, dropping some of the x‘s may reduce
multicollinearity (but this may lead to omitted variable bias!)
• Only the sampling variance of the variables involved in
multicollinearity will be inflated; the estimates of other effects
may be very precise
• Note that multicollinearity is not a violation of MLR.3
• Multicollinearity may be detected through variance inflation