Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Introduction to Econometrics
Points are out of 60 points
1. [20 points] Use the data in hprice1.dta. to estimate the following model (description of the variables in the data set is listed below in Table 1 :
price = β0 + β1sqrft + β2bdrms + u
where price = the (selling) price of the house (in 1000 dollars), sqrft = size of house (square feet) and bdrms = number of bedrooms in the house.
(a) Write out the estimation result in equation form. [2 point]
(b) What is the estimated increase in price for a house with one more bedroom keeping square footage constant? [2 point]
(c) What is the estimated increase in price for a house with an additional 1400-square-foot bedroom added? Compare this to your answer in (b). [4 points]
(d) What percentage of the variation in price is explained by square footage and number of bedrooms? Compare your answer to the adjusted R2. Explain the difference. [4 points]
(e) Consider the first house in the sample. Report the square footage and number of
bedrooms for this house. Find the predicted selling price for this house from the OLS regression line. [4 points]
(f) What is the actual selling price of the first house in the sample? Find the residual of this house. Does it suggest that the buyer underpaid or overpaid for the house? Explain. [4 points]
Table 1
DATA DESCRIPTION, FILE: hprice1.dta
Variable |
Definition |
price |
House price, in $1000. |
Assess |
Assessed value in $1000. |
bdrms |
Average number bedrooms. |
Lotsize |
Size of lot in square feet. |
Sqft |
Size of house in square feet |
colonial |
= 1 if house is in Colonial style. = 0 otherwise. |
Lprice |
Log(price) |
lassess |
Log(assess) |
llotsize |
Log(lotsize) |
lsqft |
Log(sqft) |
2. [20 points] Allcott and Gentzkow (2017) conducted an online survey of US adults regarding fake news after the 2016 presidential election. In their survey, they showed survey
respondents news headlines about the 2016 election and asked about whether the news
headlines were true or false. Some of the news headlines were fake and others were true.
Their dependent variable Yi takes value 1 if survey respondent i correctly identifies whether the headline is true or false, value 0.5 if respondent is “not sure”, and value 0 otherwise.
Suppose that one conducts a similar survey and obtains the following regression result:
Y(̂)i = 0.65 + 0.012college + 0.015ln(Daily media time) + 0.003Age, R2 = 0.14, n = 828,
(0.02) (0.004) (0.003) (0.001)
where college is a binary indicator that equals 1 if a survey respondent is college graduate and 0 otherwise, ln(Daily media time) is the logarithm of daily time consuming media, and Age is age in years.
(a) Suppose that you would like to test that people with higher education have more
accurate beliefs about news at the 1% level. State your null hypothesis precisely and report your test result. [4 points]
(b) The estimated coefficient for ln(Daily media time) is significantly positive. Interpret this result. Explain why this is plausible. [4 points]
(c) Even if Age is omitted, there will be little concern about the omitted variable bias problem. Do you agree? Explain briefly. [6 points]
(d) Suppose that you now conjecture that Republicans may have different beliefs about news than Democrats. Assume that there are three groups in the data: Democrats,
Republicans and Independents. How would you change the specification of the linear regression model by adding or subtracting regressors? Explain briefly. [6 points]
3. [20 Points] Consider the following Population Linear Regression Function (PLRF): yi = β0 + β1x1i + β2x2i + β3x3i + β4x4i + β5x5i + ui (1)
where, yi = average hourly earnings/wage in $, x1= years of education, x2 = years of potential experience, x3 = years with current employer (tenure), x4 = 1 if female, x5 = 1 if nonwhite, and ui = the usual error term of the model.
For this question, use the WAGE data set that you used in PS#2. Here is the description of the variables in the dataset for your consumption. We might be using this data set for the coming problem sets too.
Obs: 526
1. wage average hourly earnings
2. educ years of education
3. exper years potential experience
4. tenure years with current employer
5. nonwhite =1 if nonwhite
6. female =1 if female
7. married =1 if married
8. numdep number of dependents
9. smsa =1 if live in SMSA
10. northcen =1 if live in north central U.S
11. south =1 if live in southern region
12. west =1 if live in western region
13. construc =1 if work in construc. Indus.
14. ndurman =1 if in nondur. Manuf. Indus.
15. trcommpu =1 if in trans, commun, pub ut
16. trade =1 if in wholesale or retail
17. services =1 if in services indus.
18. profserv =1 if in prof. serv. Indus.
19. profocc =1 if in profess. Occupation
20. clerocc =1 if in clerical occupation
21. servocc =1 if in service occupation
22. lwage log(wage)
23. expersq exper^2
24. tenursq tenure^2
(a) Consider the following restricted version of equation (1) yi = β0 + β1x1i + β2x2i + ui. Suppose that x2 is omitted from the model by the researcher. For x2 to cause omitted variable bias (OVB), what conditions should it satisfy? Show mathematically that the OLS estimator β1 is biased if x2 is omitted from the model. [4 Points]
(b) Run a regression of yi = β0 + β4x4 + ui and interpret the slope coefficient β4 . (Hint: x4 is a binary explanatory variable.) [2 Points]
(c) First generate a dummy variable Di such that Di = 1 if male and Di = 0 if female. Then run a regression of yi = β0 + β1x1 + β4x4 + β6 Di + ui. What do you notice in the result? Explain why? Show mathematically that if x4 and Di are related, this result is inevitable. [6 Points]
(d) Run, first, a simple regression of yi = β0 + β1x1 + ui then yi = β0 + β1x1 + β2x2 + ui. Explain what happened to β1 (before and after) and why it happened. [2 Point]
(e) Now run the full model (1), using both homoscedastic-only and heteroskedasticity-robust standard errors, and interpret and compare the results of both regressions. Why do we care about heteroskedasticity problem that might exist in the data? [4 Points]
(f) Based on the regression result of the later (i.e., heteroskedasticity-robust standard errors), conduct the following hypothesis testing:
i. H0 : βi = 0 vs H1 : βi ≠ 0 where i = 1, 2, … , 5
ii. H0 : β1 = β2 = β3 = β4 = β5 = 0 vs H1 : At least one βi ≠ 0 [2 Point]
Following questions will not be graded, they are for you to practice and will be discussed at the recitation:
1. SW Exercise 7.1
2. SW Exercise 7.4
(a) The F-statistic testing the coefficients on the regional regressors are zero is 6.10. The 1% critical value (from the F 3, O distribution) is 3.78. Because 6.10 > 3.78, the regional effects are significant
at the 1% level.
(bi) The expected difference between Juanita and Molly is (X6,Juanita X6,Molly) . ®6 = ®6. Thus a 95% confidence interval is 0.27 ± 1.96 . 0.26.
(bii) The expected difference between Juanita and Jennifer is (X5,Juanita X5,Jennifer) . ®5 + (X6,Juanita
X6,Jennifer) . ®6 = ®5 + ®6. A 95% confidence interval could be constructed using the general methods discussed in Section 7.3. In this case, an easy way to do this is to omit Midwest from the regression
and replace it with X5 = West. In this new regression the coefficient on South measures the
difference in wages between the South and the Midwest, and a 95% confidence interval can be computed directly.
3. SW Empirical Exercises 7.1
Regressor |
Model |
|
a |
b |
|
Age |
0.60 (0.04) |
0.59 (0.04) |
Female |
|
−3.66 (0.21) |
Bachelor |
|
8.08 (0.21) |
Intercept |
1.08 (1.17) |
–0.63 (1.08) |
|
|
|
SER |
9.99 |
9.07 |
R2 |
0.029 |
0.200 |
R2 |
0.029 |
0.199 |
(a) The estimated slope is 0.60. The estimated intercept is 1.08.
(b) The estimated marginal effect of Age on AHE is 0.59 dollars per year. The 95%
confidence interval is 0.59 ± 1.96 × 0.04 or 0.51 to 0.66.
(c) The results are quite similar. Evidently the regression in (a) does not suffer from
important omitted variable bias.
(d) Bob’s predicted average hourly earnings = (0.59 × 26) + (− 3.66 × 0) + (8.08 × 0)
− 0.63 = $14.17. Alexis’s predicted average hourly earnings = (0.59 × 30) + (− 3.66 × 1)
+ (8.08 × 1) − 0.63 = $21.49.
(e) The regression in (b) fits the data much better. Gender and education are important
predictors of earnings. The R2 and R2 are similar because the sample size is large (n = 7711).
(f) Gender and education are important. The F-statistic is 781,which is (much) larger than the 1% critical value of 4.61.
(g) The omitted variables must have non-zero coefficients and must correlated with the
included regressor. From (f) Female and Bachelor have non-zero coefficients; yet there does not seem to be important omitted variable bias, suggesting that the correlation of Age and Female and Age and Bachelor is small. (The sample correlations are Cor (Age, Female) = −0.03 and Cor (Age,Bachelor) = 0.00).
4. How would you construct a confidence interval for a single coefficient in multiple regression?
5. Describe how to obtain a confidence set for two parameters in the multiple regression model.
6. What is a control variable in multiple regression? Give an example and explain why it can be useful in practice.