Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STA302H1: Methods of Data Analysis I
(Lecture 5)
Distribution of the regression parameters
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 2 / 33
Distribution of βˆ
• Since, βˆ is linear combination of Y, thus βˆ also follows normal and
E (βˆ|X) = E ((X′X)−1X′Y | X)
= (X′X)−1X′Xβ
= β
• Thus, βˆ|X is an unbiased estimator of β
• The variance,
Var(βˆ|X) = Var((X′X)−1X′Y | X)
= (X′X)−1X′σ2IX(X′X)−1
= σ2(X′X)−1X′X(X′X)−1
= (X′X)−1σ2
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 3 / 33
Distribution of βˆ
• Thus, assume C = (X′X)−1. Then variance of βj will be Cjjσ2 and
Cov(βk , βj) = σ2Ckj
• The least squares estimates are the Best Linear Unbiased Estimator (BLUE)
according to the Gauss-Markov theorem
• The proof of Gauss-Markov will be skipped for now
• However, we need to know that the Gauss-Markov assumes:
• E (ϵ) = 0, that is the error mean is zero
• Var(ϵ) = σ2, homoscedasticity
• It does not assume normality
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 4 / 33
Distribution of βˆ
• Like in the simple linear regression case the the βˆjs also follow normal distribution
• That is βˆj ∼ N(βj , σ2Cjj)
• We can test hypothesis,
H0 : βj = β(0)j
H1 : βj ̸= β(0)j
with a z−test, when σ2 is known, using the test statistic,
Z =
βˆj − β(0)j
σ
√
Cjj
• Or we can replace σ with S, where, S =
√
1
n − p − 1
∑n
i=1 e2i , and perform a
t−test with,
T =
βˆj − β(0)j
S
√
Cjj
where T ∼ tn−p−1
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 5 / 33
Interpretation
• These subset examples show us that the relationship between predictors and
response depends on the value that the other predictors take in our multiple
regression model.
• So when we interpret a coefficient from a multiple linear model, we must reflect
this concept
• For a one inch increase in height, we see on average a decrease of 0.55% body fat
when abdomen size is fixed at a constant value.
• We always need to make such conditional interpretation
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 6 / 33
ANOVA
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 7 / 33
The RSS for Multiple Linear Regression
• The RSS for multiple regression can be given as,
RSS =
n∑
i=1
(yi − yˆi)2 = e′e
• Again recall that e = (I−H)y, where H = X(X′X)−1X′
• Thus, RSS = y′ [I− X(X′X)−1X′] y
• What is E (RSS)?
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 8 / 33
The RSS for Multiple Linear Regression
Theorem
If y is a n × 1 random vector with mean vector µ variance-covariance matrix V (non
singular) and A is a n × n matrix of constants then,
E (y′Ay) = tr(AV) + µ′Aµ
• In the case of RSS we have A = I− X(X′X)−1X′ and V = σ2I.
• Thus, tr(AV) = tr((I− X(X′X)−1X′)σ2I) = σ2tr(I− X(X′X)−1X′)
• Recall that tr(A− B) = tr(A)− tr(B)
• Thus tr(I− X(X′X)−1X′) = tr(I)− tr(X(X′X)−1X′)
• Obviously tr(I) = n
• Recall that tr(ABC) = tr(CAB). Thus,
tr(X(X′X)−1X′) = tr(X′X(X′X)−1) = tr(Ip+1) = p + 1
• Thus, σ2tr(I− X(X′X)−1X′) = σ2(n − p − 1)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 9 / 33
The RSS for Multiple Linear Regression
• Now we know, µ = Xβ
• Thus, µ′Aµ = (Xβ)′(I− X(X′X)−1X′)Xβ = β′X′Xβ − β′X′X(X′X)−1X′Xβ = 0
• Which implies, that E (RSS) = E (e′e) = E (∑ni=1 e2i ) = (n − p − 1)σ2
• Thus, E
( ∑n
i=1 e2i
n − p − 1
)
= E (MSR) = σ2
• Where, p is the number of covariates. One can easily obtain the result for simple
linear regression when p = 1
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 10 / 33
The RSS and SSreg for Multiple Linear Regression
• The SSreg =∑ni=1(yˆi − (¯y))2 can be constructed similarly to the SLR and thus we
have,
SSreg/p
RSS/(n − p − 1) ∼ F0(p, n − p − 1)
• Since RSS and SSreg both can be represented in a vectorized form if we can show
that their dot products are zero then we know that they are independent
• We can perform a F-test. The null hypothesis is,
H0 : β1 = β2 = · · · = βp = 0
The alternative is H1 : at least one beta is ̸= 0
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 11 / 33
ANOVA table
• The ANOVA table for multiple regression looks like as following,
Sources of variation Sum Squares DF Mean Squares F value
Regression SSreg p MSreg =
SSreg
p F0 =
MSreg
MRSS
Residuals RSS n-p-1 MRSS = RSSn − p − 1
Total SST n-1
• We can create the ANOVA table using the anova command
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 12 / 33
Adjusted R2
• Recall that the coefficient of determination is defined as R2 = SSregSST
• However, as the number of variables increase in the model then R2 also increases
even if the model is not right (Why?)
• This is very closely related to the concept of over fitting (Later)
• We can correct for this increase by using an adjusted R2
• The adjusted R2 is given by,
R2adjusted = 1−
RSS/(n − p − 1)
SST/n − 1
• This accounts for the addition of multiple predictors
• If we are comparing models with different number of predictors, we should use
R2adjusted and R2 (WHY?)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 13 / 33
Partial F-test
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 14 / 33
Testing a subset of predictors
• Sometimes our interest could be to investigate the significance of a group (subset)
of predictors
• When we say we are trying to test a subset of predictors for their relationship with
the response, we are really testing which of two possible models are better.
• Consider the full model, which includes all the p predictors we think represent the
true relationship with the response
Y = β0 + β1X1 + β2X2 + ...+ βpXp + ϵ
• We fit this model and notice that the first k predictors, k < p, don’t have
significant t-tests.
• Then we can just remove the first k predictors and refit the model
Y = β0 + βk+1Xk+1 + βk+2Xk+2 + ...+ βpXp + ϵ
• The second model is called the reduced model
• We can test whether the reduced model is a better fit
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 15 / 33
Example: nyc dataset
• To perform a partial F test we are going to use the nyc.csv dataset. The file is
uploaded in Quercus
• Data from surveys of customers of 168 Italian restaurants in the target area are
available
• The data has the following variables:
1. Y : Price = the price (in $US) of dinner (including 1 drink & a tip)
2. x1 : Food = customer rating of the food (out of 30)
3. x2 : Décor = customer rating of the décor (out of 30)
4. x3 : Service = customer rating of the service (out of 30)
5. x4 : East = dummy variable = 1 (0) if the restaurant is east (west) of Fifth Avenue
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 16 / 33
Diagnostic Checking
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 17 / 33
Diagnostic Checking
• Recall the assumptions of linear regression:
• Linearity
• Homoscedasticity in the error terms
• Normaility of the errors
• One of the important tasks before we move on with our analyses is to check these
assumptions
• These are often referred to as diagnostic checking
• In this lecture we are going to discuss how to check!
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 18 / 33
Diagnostic Checking
• Recall a simple linear regression model
Y = β0 + β1X + ϵ
E (Y |X ) = β0 + β1X
ϵ = Y − E (Y |X )
• Obviously we don’t know the true relationship between Y and X
• We can only estimate the relationship
• The fiited regression yˆ = βˆ0 + βˆ1X produces the estimate for E (Y |X )
• e is an unbiased estimate of ϵ
• Thus e can be used to check the validity of the model
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 19 / 33
Anscombe’s Four Data Sets
• Anscombe (1973) constructed 4 small toy datasets, to illustrate how blindly
tting a simple linear regression model can lead to very misleading conclusions
about the data. (anscombe.txt from the textbook website)
• Each dataset contains 11 observations, with a single predictor variable and
response.
• The responses in each dataset are different, but the predictors in 3 of the 4
datasets are identical.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 20 / 33
Anscombe’s Four Data Set
• The dataset is as following
case x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
• It’s not obvious by looking at the raw data, but a linear model would only be
appropriate for one of these datasets.
• How should we check that?
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 21 / 33
Anscombe’s Four Data Set
4 6 8 10 12 14
4
5
6
7
8
9
10
11
x1
y 1
4 6 8 10 12 14
3
4
5
6
7
8
9
x2
y 2
4 6 8 10 12 14
6
8
10
12
x3
y 3
8 10 12 14 16 18
6
8
10
12
x4
y 4
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 22 / 33
Anscombe’s Four Data Set
• Now let’s look into the outputs of the coefficients (rounding up to two decimal
points)
βˆ0 βˆ1
x1 3.00 0.50
x2 3.00 0.50
x3 3.00 0.50
x4 3.00 0.50
• Does this mean that linear model is appropriate for all the data?
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 23 / 33
Anscombe’s Four Data Sets
4 6 8 10 12 14
4
5
6
7
8
9
10
11
x1
y 1
4 6 8 10 12 14
3
4
5
6
7
8
9
x2
y 2
4 6 8 10 12 14
6
8
10
12
x3
y 3
8 10 12 14 16 18
6
8
10
12
x4
y 4
• Dataset 1: The linear model is
appropriate
• Dataset 2: Relationship
quadratic. So not appropriate
• Dataset 3: The line is influenced
by one outlier. Not appropriate
• Dataset 4, the slope of the
regression is determined by a
single point. Not appropriate
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 24 / 33
Anscombe’s Four Data Sets
• Anscombe’s dataset was meant to illustrate how one should always supplement
one’s modelling with an investigation of the modelling assumptions
• We first need to check whether the relationship between X and Y are linear
• Scatterplot does not always help. For example when we have more then one
predictor (multiple regression)
• An individual observation can have massive impact on model fit
• What would be an easy way to check model assumptions
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 25 / 33
Residual Plots
• One way to check the assumptions is through residual plot
• Plotting residuals allows us to visually inspect the model assumptions (WHY?)
• Residuals measure the remaining variability in the data after fitting a model.
Thus, it is more sensitive to any irregularities
• There are two main types of residual scatter plots that we use:
• Residual vs predictor plot, i.e., plot e against the observed values of X . Not
appropriate for multiple regression
• Residual vs fitted plot i.e., plot e against yˆ
• Both plots are usefull to check whether the assumptions hold
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 26 / 33
Residual Plots
• Assumptions hold if there is no evident pattern seen in the residual plots
• specifically, we hope that the residuals are uniformly scattered around 0 for the
full range of predictor values
• Important patterns to be aware of for each of the assumptions are:
• Linearity of the relationship: any systematic pattern in the residuals, such as a
curve
• Independence of the errors: clusters of residuals that have obvious separation
from the rest
• Homoscedasticity: any pattern, but especially a fanning pattern, where residuals
gradually become more spread out