UN3412
Problem Set 1
Introduction to Econometrics
(Erden - Section 1)
Please make sure to select the page number for each question while you are uploading your solutions to Gradescope. Otherwise, it is tough to grade your answers, and you may lose points.
“Calculator” was once a job description. This problem set gives you an opportunity to do some calculations on the relation between smoking and lung cancer, using a (very) small sample of five countries. The purpose of this exercise is to illustrate the mechanics of ordinary least squares (OLS) regression. You will calculate the regression “by hand” using formulas from class and the textbook. For these calculations, you may relive history and use long multiplication, long division, and tables of square roots and logarithms; or you may use an electronic calculator or a spreadsheet.
The data are summarized in the following table. The variables are per capita cigarette consumption in 1930 (the independent variable, “X”) and the death rate from lung cancer in 1950 (the dependent variable, “ Y”). The cancer rates are shown for a later time period because it takes time for lung cancer to develop and be diagnosed.
Observation # |
Country |
Cigarettes consumed per capita in 1930 (X) |
Lung cancer deaths per million people in 1950 (Y) |
1 |
Switzerland |
530 |
250 |
2 |
Finland |
1115 |
350 |
3 |
Great Britain |
1145 |
465 |
4 |
Canada |
510 |
150 |
5 |
Denmark |
380 |
165 |
Source: Edward R. Tufte, Data Analysis for Politics and Management, Table 3.3.
1. (21p) Use a calculator, a spreadsheet, or “by hand” methods to compute the following: refer to the textbook for the necessary formulas. (Note: if you use a spreadsheet, attach a printout)
(a) (3p) The sample means of Xand Y, X and Y .
(b) (3p) The standard deviations of Xand Y, sX and sY.
(c) (3p) The correlation coefficient, r, between Xand Y.
(d) (3p)β(^)1 , the OLS estimated slope coefficient from the regression Yi = β0 + β1Xi + ui (e) (3p)β(^)0 , the OLS estimated intercept term from the same regression.
(f) (3p) Y(ˆ)i , i = 1,…, n, the predicted values for each country from the regression
(g) (3p) u(ˆ)i , the OLS residual for each country.
2. (4p) On graph paper or using a spreadsheet, graph the scatterplot of the five data points and the regression line. Be sure to label the axes, clearly show the data points.
3. (15p) You are hired by the governor to study whether a tax on liquor has decreased average liquor consumption in New York. From a random sample of n individuals in New York, you obtain each person’s liquor consumption both for the year before and for the year after the introduction of the tax. From this data, you compute Yi ="change in liquor consumption" for individual i = 1, … . n. Yi is measured in ounces so if, for example, Yi = 10, then individual i increased his liquor consumption by 10 ounces. Let the parameters μy and σy2 of Y denote the population mean and variance of Y.
(a) (3p) You are interested in testing the hypothesis H0 that there was no change in liquor consumption due to the tax. State this formally in terms of the population parameters.
(b) (3p) The alternative, H1, is that there was a decline in liquor consumption; state the alternative in terms of the population parameters.
(c) (3p) Suppose that your sample size isn = 900 and you obtain estimates Y(̅) = -32.8 and
sy = 466.4. Report the t-statistic for testing H0 against H 1. Obtain the p-value for the test [use Table 1 in Stock and Watson,p. 749-750]. Do you reject at a 5% level? At 1% level?
(d) (3p) Would you say that the estimated fallin consumption is large in magnitude? Comment on the practical versus statistical significance of this estimate.
(e) (3p) In your analysis, what has been implicitly assumed about other determinants of liquor consumption over the two-year period in order to infer causality from the tax change to liquor consumption?
4. (6p) Let Y be a Bernoulli random variable with success probability Pr(Y=1) = p, and let
Y1 ,..., Yn bei.i.d. draws from this distribution. Let p(ˆ) be the fraction of successes (1s) in this sample.
(a) (2p) Show that p(ˆ) = Y
(b) (2p) Show that p(ˆ) is an unbiased estimator of p.
(c) (2p) Show that var( p(ˆ) ) = p(1-p)/n
5. (8p) Let Y1, Y2, Y3, Y4, be independently, identically distributed random variables from a
population with mean μand variance σ2. Let Y = (1/4) (Y1+Y2+Y3+Y4) denote the average of these four random variables.
(a) (2p) What are the expected value and variance of Y in terms of μ and σ2?
(b) (2p) Now, consider a different estimator of μ: Ỹ =(1/8)Y1+(1/8)Y2,+(1/4)Y3+(1/2)Y4. This is an example of a weighted average of the Yi.’s. Show that Ỹ is also an unbiased estimator of μ . Find the variance of Ỹ.
(c) (2p) Based on your answer to parts (a) and (b), which estimator of μ do you prefer, Y or Ỹ?
(d) (2p) Suppose Y1, Y2, Y3, Y4 follow a Normal distribution with mean μ=5 and variance σ2=3. What is the distribution of Y and Ỹ ?
6. (6p) Suppose at Columbia University, grade point average (GPA) and SAT scores are related by the conditional expectation E(GPA|SAT) = .90 + .001 SAT.
(a) (2p) Find the expected GPA when SAT = 1600. (b) (2p) Find E(GPA|SAT=2200)
(c) (2p) If the average SAT in the university is 2000, what is the average GPA?
7. (12p) Suppose that X is randomly drawn from a uniform distribution on the interval [0, 3]. Also, suppose that after the value X = x has been observed (0 < x < 3), Y is randomly drawn from auniform distribution on the interval [x, 3].
(a) (3p) For any given value of x (0 < x < 3), obtain E[Y |X = x].
(b) (3p) In view of part (i), obtain E[Y|X].
(c) (3p) What is the difference between E[Y|X = x] and E[Y |X]?
(d) (3p) Obtain E[Y].
8. (18p) Adult males are taller, on average, than adult females. Visiting two recent American
Youth Soccer Organization (AYSO) under-12-years-old (U12) soccer matches on a Saturday, you do not observe an obvious difference in the height of boys and girls of that age. You suggest to your little sister that she collect data on height and gender of children in 4th to 6th grades as part of her science project. The accompanying table shows her findings.
Height of Young Boys and Girls, Grades 4-6, in inches
Boys |
Girls |
||||
Y(̅)BoYs |
s BoYs |
n BoYs |
Y(̅)Girls |
sGirls |
nGirls |
57.8 |
3.9 |
55 |
58.4 |
4.2 |
57 |
Where Y(̅)BoYs is the sample average height for boys,n BoYs is the number of boys in the sample, s2BoYs is the sample variance of height of boys.
(a) (3p) Let your null hypothesis be that there is no difference in the height of females and males at this age level. Specify the alternative hypothesis.
(b) (3p) What is the unbiased estimate of the difference in height between boys and girls?
Provide a formula and check the unbiasedness. Calculate the value of this estimate for the given sample.
(c) (3p) Derive the formula for the variance of the estimate from (b). Calculate the estimate of the variance for the given sample.
(d) (3p) Create a statistic for testing the hypothesis in (a) using the Central Limit Theorem and the Law of Large Numbers.
(e) (3p) Calculate the t-statistic for comparing the two means. Is the difference statistically
significant at the 1% level? Which critical value did you use? Why would this number be smaller if you had assumed a one-sided alternative hypothesis? What is the intuition behind this?
(f) (3p) Generate a 95% confidence interval for the difference in height.
9. (10p) Use the following data to show Law of Iterated Expectations.
(i.e. Show that E(M) = E[E(M|A)])
Following questions will not be graded, they are for you to practice and will be discussed at the recitation:
10. [Practice question, not graded] SW 2.3
|
Rain (X=0) |
No Rain (X=1) |
Total |
Long Commute (Y=0) |
0.15 |
0.07 |
0.22 |
Short Commute (Y=1) |
0.15 |
0.63 |
0.78 |
Total |
0.30 |
.70 |
1.00 |
Using the random variables X and Y from Table 2.2 (given above), consider two new random variables W = 3 + 6X and V = 20 – 7Y. Compute:
(a) E(W) and E(V).
(b) σ²W and σ²V.
(c) σW,V and Corr(W,V).
11. [Practice question, not graded] SW 2.6
The following table gives the joint probability distribution between employment status and college graduation among those either employed or looking for work (unemployed) in the working age US population, based on the 1990 US Census.
|
Unemployed (Y=0) |
Employed (Y=1) |
Total |
Non-college grads (X=0) |
0.045 |
0.709 |
0.754 |
College grads (X=1) |
0.005 |
0.241 |
0.246 |
Total |
0.050 |
0.950 |
1.000 |
(a) Compute E(Y).
(b) The unemployment rate is the fraction of the labor force that is unemployed. Show that the unemployment rate is given by 1-E(Y).
(c) Calculate the E(Y|X=1) and E(Y|X=0).
(d) Calculate the unemployment rate for (i) college graduates and (ii) non-college graduates. (e) A randomly selected member of this population reports being unemployed. What is the
probability that this worker is a college graduate? A non-college graduate? (f) Are educational achievement and employment status independent? Explain.
12. [Practice question, not graded] SW 2.14 [Hint: Use SW Appendix Table 1.]
In a population E[Y] = 100 and Var(Y) = 43. Use the central limit theorem to answer the following questions:
(a) In a random sample of size n = 100,find Pr( Y ≤101) (b)In a random sample of size n = 165,find Pr( Y >98)
(c) In a random sample of size n = 64,find Pr(101 ≤ Y ≤103)
13. [Practice question, not graded] SW 3.12
To investigate possible gender discrimination in a firm, a sample of 100 men and 64 women with similar job descriptions are selected at random. A summary of the resulting monthly salaries are:
|
Avg. Salary (Y ) |
Stand Dev (of Y) |
n |
Men |
$3100 |
$200 |
100 |
Women |
$2900 |
$320 |
64 |
(a) What do these data suggest about wage differences in the firm? Do they represent
statistically significant evidence that wages of men and women are different? (To answer this question, first state the null and alternative hypothesis; second, compute the relevant t-statistic; and finally,use the p-value to answer the equation.)
(b) Do these data suggest that the firm is guilty of gender discrimination in its compensation politics? Explain.
14. [Practice question, not graded] SW 2.10 [Hint: Use SW Appendix Table 1.]
Compute the following probabilities:
(a) If Y is distributed N(1,4), find Pr(Y≤3). (b) If Y is distributed N(3,9), find Pr(Y>0).
(c) If Y is distributed N(50,25), find Pr(40≤Y≤52).
(d) If Y is distributed N(5,2), find Pr(6≤Y≤8)
15. [Practice question, not graded] SW 3.3
In a survey of 400 likely voters, 215 responded that they would vote for the incumbent and 185 responded that they would vote for the challenger. Let p denote the fraction of all likely voters that preferred the incumbent at the time of the survey, and let p(ˆ) be the fraction of survey respondents that preferred the incumbent. (a) Use the survey results to estimate p.
(b) Use the estimator of the variance of p(ˆ) , p(ˆ) (1 - p(ˆ) )/n to calculate the standard error of your estimator.
(c) What is the p-value for the test H0: p=0.5 vs. H1:p≠0.5?
(d) What is the p-value for the test H0: p=0.5 vs. H1:p>0.5?
(e) Why do the results from (c) and (d) differ?
(f) Did the survey contain statistically significant evidence that the incumbent was ahead of the challenger at the time of the survey? Explain.