Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ECMT2150 – Lecture 6
Topics Today Week 6 Model Specification Specification errors => including irrelevant variables => omitting relevant variables – Omitted Variable Bias More on multicollinearity and sampling variance Endogeneity Proxy Variables Reference: Chapters 3.3b; 3.4a; Chp 9.2 Recap • Econometrics is about estimating economic relationships, e.g. between wages and education, etc. • Simple and multiple linear regression – many examples – Incorporated non-linearities: logs & polynomials – Incorporated qualitative and categorical information – Discussed goodness of fit Recap • Derived the OLS estimator – understand the steps/the intuition. Derivation is not examinable • Statistical Properties of OLS that hold for any sample under given assumptions – Expected values/unbiasedness under MLR.1 – MLR.4 – Variance formulas under MLR.1 – MLR.5 – Gauss-Markov Theorem under MLR.1 – MLR.5 – Exact sampling distributions/tests under MLR.1 – MLR.6 • Asymptotic Properties of OLS (Consistency and Asymptotic normality) Recap Why? • Importance of the Zero Conditional Mean Assumption: E(u|x) = E(u) = 0 • Reliability of estimator and inference rest on whether the assumptions hold • Causality - understanding what we have to assume to obtain causal estimates of our parameters of interest • + we need the variance formulas and sampling distributional assumptions to conduct inference Recap • Inference – Hypothesis tests • T-tests: – One-sided – Two sided (tests of statistical significance) • p-values • Confidence intervals – Testing more general alternatives • An estimate is equal to a constant • One estimate is equal to another => Linear combination of parameters • Testing multiple linear restrictions Specification Errors Specification errors • Until now we have assumed our population model = 0 + 11 + … + + has been correctly specified • A bit unrealistic – we can never be sure of the true population model • Types of specification errors: – Choice of independent variables • Over-specification & sampling variance effects • Omitted variables & Endogeneity – Heteroskedasticity – Functional Form – Measurement Error – Missing Data, Non-random Sampling & Outliers Misspecification in the choice of independent variables Misspecification in the choice of independent variables A. Including irrelevant variables in a regression model (over-specifying) – This model satisfies MLR.1-MLR.4 – x3 may be correlated with x1 and x2 – Crucially, in the population, x3 has no effect on y after we control for x1 and x2 ⇒ Inclusion of x3 has no cost in terms of bias in the estimates of any of the parameters, because – However, including irrelevant variables may increase the sampling variance (more on this shortly) = 0 in the population Misspecification in the choice of independent variables B. Omitting relevant variables (Wooldridge 3.3b) ⇒Omitted Variable Bias ⇒Violate MLR.4, E(u|x)=0 The simple case: but due to our ignorance or data unavailability, we estimate True model (contains x1 and x2) Estimated model (x2 is omitted) Misspecification in the choice of independent variables B. Omitting relevant variables (Wooldridge 3.3b) Example: Omitting ability in a wage equation True model: We estimate: Omitting a relevant variable causes bias when the omitted variable is correlated with any of the other explanatory variables in the model Omitted Variable Bias – the simple case Let‘s look at this in more detail: • If x1 and x2 are correlated, assume a linear regression relationship between them: • The true model is: If y is only regressed on x1 this will be the estimated intercept If y is only regressed on x1, this will be the estimated slope on x1 error term And the bias = Conclusion: All estimated coefficients will be biased. Omitted Variable Bias – the simple case Our example again: Omitting ability in a wage equation Will be positive The return to education 1 will be over-estimated because 21 > 0. It will look as if people with many years of education earn very high wages, but this is partly due to the effect of ability - the fact that people with more education are also more able on average. Omitted Variable Bias – the simple case Summarising the direction of the bias: Omitted Variable Bias – the simple case What about the size of the bias? Our example again: Omitting ability in a wage equationln = 0 + 1 + 2 + As above, the return to education 1 will be over-estimated because 21 > 0. By how much? For example, if the return to educ in the population is 8.6% • a bias of 21= 0.1 percentage points – not so worrying • a bias of 21= 3 percentage points – big concern Omitted Variable Bias – the simple case Q: When is there no omitted variable bias? A: When the omitted factors are i) unrelated => 1, 2 = 0, 1 = 0 ii) when they don‘t affect the outcome => 2 = 0 In our wages example: • 1, 2 > 0 if individuals with high innate ability tend to have higher education. • 2 > 0 if individuals with high ability tend to have higher productivity and wages. Together, we would expect that OLS overestimates 1 Omitted Variable Bias: more general cases (Wooldridge 3.3c) – No general statements possible about direction of bias – If 1, 3 ≠ 0, we can analyse as per the simple case if and only if 2is uncorrelated with 1 and 3 others • Example: Omitting ability in a wage equation True model (contains x1, x2, and x3) Estimated model (x3 is omitted) If experience is approximately uncorrelated with educ and abil, then the direction of the omitted variable bias can be as analyzed in the simple two variable case. Omitting relevant variables => Inconsistency • Not only is the OLS estimator biased when we omit relevant variables, it is also inconsistent see section 5.1a, Wooldridge • We can show that: �1 = 1 + 2 ∑ 1 − ̅1 2∑( 1 − ̅1 2) + ∑ 1 − ̅1 ∑( 1 − ̅1 2) • Then taking the plim on both sides, we have: �1 = 1 + 2 (1, 2)(1) Sampling Variances Specification Errors and Multi-collinearity • So omitting a relevant variable can cause bias. • Including irrelevant variables does not cause bias • BUT, you might consider leaving them out so as to not unnecessarily inflate the sampling variance ⇒there is a trade-off to be made with the effect on the variance of our slope parameters Sampling Variance: Mis-specified Models (Wooldridge 3.4b) • The choice of whether to include a particular variable in a regression can be made by analyzing this trade-off between bias and variance • It might be the case that the likely omitted variable bias in the misspecified model 2 is compensated by a smaller variance True population model Estimated model 1 Estimated model 2 Sampling Variance: Misspecified Models (Wooldridge 3.4b) • Variance in the misspecified model • Case 1: • Case 2: Conditional on x1 and x2, the variance in model 2 is always smaller than that in model 1 Conclusion: Do not include irrelevant regressors Trade off bias and variance Sampling Variance: Misspecified Models Caution!! Bias will not vanish even in large samples. But the variance of � will decrease with a large sample Multicollinearity & sampling variance – Recall: – Linear relationships between explanatory variables can create problems – High multicollinearity can occur when R2j is ‘close’ to 1 – Ideally, we have little correlation between xj and other independent variables – Yet this may not be the case. – For example, examining the effect of various school expenditure categories on school performance • It is expected that wealthier schools will spend more on everything than less wealthy schools • It can be difficult to estimate the effect of any category of school expenditure on student performance when there is little variation in one category Multicollinearity & sampling variance Average standardized test score of a school Expenditure on teachers Expenditure on instructional materials Other expenditures The different expenditure categories will be strongly correlated: if a school has a lot of resources it will spend a lot on everything. It will be hard to estimate the differential effects of different expenditure categories because all expenditures are either high or low. For precise estimates of the differential effects, one would need information about situations where expenditure categories change differentially. So,... often the sampling variance of the estimated effects will be large. Multicollinearity & sampling variance • Because effects cannot be disentangled, it may be better to lump all expenditure categories together • In other cases, dropping some of the x‘s may reduce multicollinearity (but this may lead to omitted variable bias!) • Only the sampling variance of the variables involved in multicollinearity will be inflated; the estimates of other effects may be very precise • Note that multicollinearity is not a violation of MLR.3 • Multicollinearity may be detected through variance inflation