QBUS2810 Statistical Modelling for Business
Statistical Modelling for Business
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
QBUS2810
Statistical Modelling for Business
Exam Practice
Q1: Some MC questions
Q1(i) A poll constructed a 95% CI of (0.63,0.73) for the proportion of NSW resi-
dents that support the continuation of the alcohol lockout laws. What is the accurate
interpretation of this CI?
(a) There is a 95% probability that the sample proportion is between 0.63 and 0.73.
(b) The poll estimated that the proportion of NSW residents that supports the
lockouts is between 0.63 and 0.73. The true proportion will be inside an interval
formed in this manner, in 95% of repeated samples.
(c) There is a 95% probability that a random sample of the NSW population will
yield a sample proportion between 0.63 and 0.73.
(d) There is a 95% probability that the proportion of NSW residents that support
the lockouts is between 0.63 and 0.73.
(e) We are 95% confident that the true proportion is inside the interval (0.63,0.73).
(f) Both (b) and (e) are correct.
Q1(ii) Why do we need an error term in a regression model?
(a) Because real variables do not follow exact laws, like e.g. physics, so we cannot
perfectly predict any rv Y .
(b) Because in practice we don’t have the complete set of explanatory variables X
that would be needed to perfectly predict Y .
(c) Because in practice we don’t know the exact form of the relationship between
Y and X, i.e. if Y = f(X) we do not know f .
(d) Because a statistical model is for an rv Y that has a distribution, conditional
upon X, implying that we cannot predict Y perfectly, only the possible values
it may take and their relative likelihood of occurring, i.e. p(Y |X).
2(e) All of the above are correct reasons.
Q1(iii) Why in OLS for SLR is the sample average error, e¯ = 1
n
∑n
i=1 ei = 0?
(a) Because it is an error term , it has to average 0.
(b) Because each error ei is equal to 0, therefore the average of a set of 0s is 0.
(c) Because when we do OLS, we take the 1st derivative of the RSS, and the
derivative with respect to β1 is −2× the sum of the errors times X, i.e. −2 ×∑n
i=1 eiXi. We set this sum equal to 0 to get the LS estimate. Thus e¯ = 0
(d) Because when we do OLS, we take the 1st derivative of the RSS, and the
derivative with respect to β0 is −2× the sum of the errors, i.e. −2 ×
∑n
i=1 ei.
We set this sum equal to 0 to get the OLS estimate. Thus e¯ = 0.
(e) None of the above are correct.
Q1(iv) LSA 2 states that E(εi|Xi) = 0. This implies that the residual series ε and X
are uncorrelated, because:
(a) Other factors always exist and are implicitly affecting Y through ε, thus ε and
X must be uncorrelated.
(b) ε is an i.i.d error series and hence must be uncorrelated with X.
(c) If they were correlated, then the slope of the regression of i on Xi would not
be 0, i.e. we could write E(εi|Xi) = γ0 + γ1Xi and γ1 6= 0. Thus LSA 2 would
not be correct.
(d) Other factors always exist and are implicitly affecting Y through ε, thus ε and
X must be correlated. Hence, LSA 2 does not imply they are uncorrelated.
Q1(v) Would the variable number of children cause OVB regarding the effect of Salary
on Amount Spent?
(a) Number of children would not be correlated with Salary, so: NO.
(b) Number of children is likely correlated with Salary, but it would not be a factor
determining Amount Spent, so: NO.
(c) Number of children is likely correlated with Salary. Also, number of children
could be a factor determining Amount Spent, so: YES.
(d) Even though number of children is a likely determinant of Amount Spent, it
would not be correlated with Salary, so NO.
3(e) We should first look at the sample correlation between number of children and
Salary here. Then, decide whether number of children could be a determinant
of Amount Spent.
Q1(vi) Would the variable IQ level cause OVB regarding the effect of Salary on
Amount Spent?
(a) It is likely that IQ is correlated with Salary. It is unlikely that IQ is a determi-
nant of Amount Spent for a company like Direct Marketing which sells clothing,
books and sports gear. So: NO
(b) IQ would not be correlated with Salary nor would it determine Amount Spent,
so NO.
(c) IQ would be correlated with Salary and thus also be correlated with Amount
Spent, since Salary is correlated with Amount Spent. Thus, YES.
(d) IQ would not be correlated with Salary, but it would help determine Amount
Spent, so NO.
Q1(vii) V (ei|Xi) = σ2(1 − Hi,i), i.e. estimated errors for observations with higher
leverage have lower variance. Why is this?
(a) Outliers in the response variable Y drag the estimated regression line towards
them, thus they have lower observed error variance.
(b) Outliers in the response variable Y have larger variance, thus they also have
lower leverage; thus non-outliers in Y have smaller leverage and hence larger
variance.
(c) Extreme predictor observations, i.e. those with unusual X values, have higher
variance; thus non-outliers in X have smaller leverage and hence larger variance.
(d) Extreme predictor observations, i.e. those with unusual X values, drag the
estimated regression line towards them, thus their associated error variance is
smaller.
Q1(viii) The variance of the regression line estimate at observation i is given by σ2Hi,i.
This is comparatively larger for points with high leverage. Why is this?
4(a) Observations far away from the sample mean, i.e. with high leverage and high
hat values, tend to drag the estimated regression line towards them, and poten-
tially away from the true regression line, thus increasing its variance at those
points.
(b) Observations far away from the sample mean, i.e. with high leverage and high
hat values, tend to drag the estimated regression line towards them, thus have
smaller estimated error variance; these points thus have smaller variance of the
regression line.
(c) Observations close to the sample mean, i.e. with low leverage and low hat
values, have the highest error variance, thus they have the lowest regression line
variance.
(d) Observations close to the sample mean, i.e. with low leverage and low hat
values, have the highest error variance, thus they have the highest regression
line variance.
Q1(ix) The variance of an out of sample observation i is given by σ2(1 + Hi,i). This
is again comparatively larger for points with high leverage. This is the opposite result
to that for observations in-sample: in-sample, those with high leverage have the lowest
variance. Why do we get the opposite result out of sample to in-sample?
(a) Out of sample observations with high leverage drag the regression estimate line
towards them. Therefore, they have higher variance.
(b) Out of sample observations with high leverage drag the regression estimate line
towards them. Therefore, they have lower variance.
(c) Out of sample observations cannot drag the regression estimate line towards
them. If they have high leverage, then the regression line estimate has more
variation in that region, so the out of sample observations can be further away
from the regression estimate in that region too; hence they have higher variance.
(d) Out of sample observations cannot drag the regression estimate line towards
them. If they have high leverage, then they will be outliers and hence have
higher variation.
5Question 2
These questions require short answers, say up to 0.5 page, but not essays. Make your
answers as objective and concise as possible, while still fully answering the question.
(a) The Capital Asset Pricing Model (CAPM) is usually written as:
Rt −Rf,t = α + β(RM,t −Rf,t) + t,
where Rt is the asset return, Rf,t is the risk free rate of return, and RM,t is the market
return, all at time t.
Explain why the slope β is said to capture “market risk”.
(b) A CAPM regression is fit where the response variable is the Australian All Or-
dinaries index (AORD) returns, and the response is the daily return on BHP stock.
Consider the variables:
• (i) the daily return on another mining company: Rio-Tinto;
• (ii) Season of year: three dummy variables representing the seasons of the year:
Summer, Spring, Autumn.
Discuss whether each of these two variables could cause omitted variable bias here, and
then explain why or why not (for each).
6Question 3
Consider the OLS estimated model, representing (log-)wages versus number of years of
education and number of years of experience in the workforce for n = 522 individuals
in the Australian workforce:
̂log(Wage) = 0.794
(0.027)
+ 0.094
(0.002)
× Education + 0.037
(0.001)
× Experience
where SER = 1.267. Here Wage is measured in units of 10, 000 dollars.
(a) Interpret the coefficients for (i) Education; and (ii) Experience.
(b) We want to predict the wage of a worker with 15 years education and 10 years of
experience. An appropriate way to do this is by using the formula:
Ŵage = exp
(
β̂0 + β̂1 × 15 + β̂2 × 10
)[
(1/n)
n∑
i=1
exp(ei)
]
.
Why do we include the term (1/n)
∑n
i=1 exp(ei) instead of predicting the wage simply
as exp
(
β̂0 + β̂1 × 15 + β̂2 × 10
)
? Discuss and give details.
(c) For this dataset and model, the term [(1/n)
∑n
i=1 exp(ei)] = 1.05. Predict the wage
of a worker with 15 years education and 10 years of experience.
(d) Assume that the errors in the log-linear model are normally distributed and that
the hat value for the worker in part (c), who is not in the dataset, is 0.01. Using
the appropriate t-value of −1.964545 = t519,0.025, find an approximate 95% prediction
interval for the log-wage of the individual worker in part (c); then find the equivalent
95% prediction interval for the actual wage of that worker. Show full details of your
methods.
(e) Suppose that we include a new predictor that is negatively correlated with educa-
tion, for example a variable that indicates whether the individual is from a disadvan-
taged background or not. Would the variance of the estimator for the coefficient on
7education increase, decrease, or remain the same in this new MLR model fit by OLS?
Why?
8Question 4
A recent study which was widely written about in the media examined the relationship
between the number of friends a individual has on a Facebook and grey matter density
in the areas of the brain associated with social perception and associative memory.
Below we run a regression using data from this study, where GMdensity is a normalised
z-score of grey matter in relevant regions of the brain.
OLS Regression Results
==============================================================================
Dep. Variable: FBfriends R-squared: 0.190
Model: OLS Adj. R-squared: 0.169
Method: Least Squares F-statistic: 8.936
Date: Prob (F-statistic): 0.00488
Time: Log-Likelihood: -260.14
No. Observations: 40 AIC: 524.3
Df Residuals: 38 BIC: 527.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 366.6449 26.347 13.916 0.000 313.309 419.981
GMdensity 82.4488 27.581 2.989 0.005 26.614 138.284
==============================================================================
Omnibus: 1.388 Durbin-Watson: 0.340
Prob(Omnibus): 0.500 Jarque-Bera (JB): 1.004
Skew: 0.009 Prob(JB): 0.605
Kurtosis: 2.224 Cond. No. 1.12
==============================================================================
(a) Write out the estimated regression equation, in a manner that includes the esti-
mated coefficients, standard errors, and R2.
(b) Interpret the R2 value.
(c) Test whether there is a relationship between grey matter density and number of
Facebook friends. Write the details for the test, show how the test statistic is calculated,
find the p-value in the output, and carefully state the conclusion. Note that the
9assumptions have not been provided, so you have to list them and assess them, as best
you can.
(d) Calculate and then interpret a 99% confidence interval for the slope. The critical
value is 2.71 (you need to specify where this number comes from).