Multiple Linear Regression Model
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Multiple Linear Regression Model - Part 1
Model description, least squares, model inference, prediction
STAT 3022/3922
Applied Linear Models
Model Description
Model description
• Suppose we have an outcome Y that you want to predict
from p− 1 predictors (or covariates) X1, . . . , Xp−1. The
population multiple linear regression model is
E(Y | X1 = x1, . . . , Xp−1 = xp−1) = β0 + β1x1 + . . .+ βp−1xp−1
Var(Y | X1 = x1, . . . , Xp−1 = xp−1) = σ2
• Predictors can be either one of the following types:
1. Separate variables, eg: height (X1), weight (X2), ...
2. Polynomial of a previous predictor; e.g X2 = X
2
1
3. Transformed variable; e.g X2 = log(X1),
4. Interaction effect: e.g X3 = X1X2
5. Qualitative (categorical variables): X1 =
{
1 if female
0 if male
1
Model description
• When we say linear model, we mean linear in parameters, not
in predictors.
- There is no such term like ββ10 or e
β1 in the model.
• The linear model above may cover many complicated
relationships between Y and the predictors X1, . . . , Xp−1.
• The population multiple linear regression (MLR) model can be
alternatively written as
Y = β0 + β1X1 + . . .+ βp−1Xp−1 + ε
with E(ε) = 0 and Var(ε) = σ2. If we assume ε ∼ N(0, σ2),
then we have the normal MLR model.
2
Model description
Assume we observe data following from the MLR model, meaning
yi = β0 + β1xi1 + . . .+ βp−1xi,p−1 + εi, i = 1, . . . , n
with the following assumptions:
(A1) E(εi) = 0, i = 1, . . . , n
(A2) Var(εi) = σ
2, i = 1, . . . , n
(A3) For any i 6= j, the random noises εi and εj are independent or
at least uncorrelated.
If we assume (A4) εi ∼ N(0, σ2), then we say we have data
following from a normal MLR model.
3
Matrix representation
We have p coefficients, so instead of working with each of them
separately, we can work with all of them together through the
following matrix representation of the MLR model.
First, we put all the y data to form a response vector:
yn×1 =
y1...
yn
,
and similarly, we can form a coefficient vector and an error vector
βp×1 =
β0
β1
...
βp−1
, εn×1 =
ε1...
εn
4
Matrix representation
Finally, we put all the covariates data on the design matrix
Xn×p =
1 x11 x12 . . . x1,p−1
1 x21 x22 . . . x2,p−1
...
...
...
...
1 xn1 xn2 . . . xn,p−1
.
Hence, we can write the MLR model as the following concise
matrix representation:
y = Xβ + ε. (1)
5
Matrix representation
From the above equation, we can also write
yi = x
>
i β + εi, i = 1, . . . , n.
The assumptions in the matrix form are
(B1) E(ε) = 0.
(B2) Var(ε) = σ2In, where In is the n× n identity matrix (this
assumption includes assumptions (A2) and (A3) above).
If we have normal MLR, then
(B3) ε follows a multivariate normal distribution ε ∼ Nn(0, σ2In).
6
Matrix representation
Similar to what we do in the simple linear model, we assume X to
be deterministic (i.e not random). As a result,
E(y) = Xβ,
Var(y) = σ2In.
If we assume (B3), then
y ∼ Nn(Xβ, σ2In).
7
Fuel consumption
Model Y = fuel consumption (mpg) of automobiles as a linear
model of p− 3 covariates, including X1 = displacement (disp), X2
= gross horsepower (hp) and X3 = rear axle ratio (drat) from
n = 32 observations.
Corr:
−0.848***
Corr:
−0.776***
Corr:
0.791***
Corr:
0.681***
Corr:
−0.710***
Corr:
−0.449**
mpg disp hp drat
m
pg
disp
hp
drat
10 15 20 25 30 35 100 200 300 400 100 200 300 3.0 3.5 4.0 4.5 5.0
0.00
0.02
0.04
0.06
100
200
300
400
100
200
300
3.0
3.5
4.0
4.5
5.0
8
Least Square Estimator
Least-square estimators
Similar to the simple linear model, we estimate the coefficients by
minimizing the sum of squared residuals:
D(b0, b1, . . . , bp−1) =
n∑
y=1
(yi − b0 − b1xi,1 − . . .− bp−1xi,p−1)2
= (y−Xb)>(y−Xb)
= y> y−b>X> y−y>Xb+ b>X>Xb
= y> y−2b>X> y+b>X>Xb.
with b = (b0, . . . , bp−1)>. The least square estimator is therefore
βˆ = argmin
b
D(b)
9
Normal equations
To solve for βˆ, we take partial derivative with respect to each
component b to obtain the normal equations of MLR:
∂D
∂b
= 2X> y−2X>Xb = 0
and hence
X>(y−Xb) = 0
X>Xb = X> y
Note that X>X is a p× p matrix, so the equation above has a
unique solution if and only if rank(X>X) = p. In this case,
βˆ = (X>X)−1X> y .
10
Solving normal equations
When does the matrix X>X has a full rank?
• Because rank(X>X) = rank(X), this condition is satisfied if
and only if rank(X) = p.
However, since rank(X) ≤ min(n, p), the OLS estimate is only
defined if and only if two following conditions are satisfied:
1. n > p (this is usually called a low-dimensional setting).
2. No predictor is a linear combination of the other predictors
(no perfect multicollinearity).
In R, if you are able to get the OLS estimates from the lm
command, that means both the conditions are satisfied.
11
Fuel consumption data
12
Properties of least square estimators
From now on, we will assume the OLS estimate for MLR is
defined. Then,
• βˆ is an unbiased estimator for β
E(βˆ) = E
{
(X>X)−1X> y
}
= (X>X)−1X> E(y)
= (X>X)−1X>Xβ = Ipβ = β.
• Next, we find the covariance-variance matrix of βˆ
Var(βˆ) = Var
{
(X>X)−1X> y
}
= (X>X)−1X> Var(y)X(X>X)−1
= (X>X)−1X> σ2InX(X>X)−1
= σ2(X>X)−1.
13
Properties of the least square estimator
• Finally, the OLS estimates βˆ is a linear combination of y data.
βˆ = (X>X)−1X> y = Ry .
with R = (X>X)−1X>.
• Among all linear unbiased estimator, βˆ has the minimum
variance (Gauss-Markov Theorem).
• We say that the OLS estimates βˆ is the Best Linear Unbiased
Estimator (BLUE).
- Note that this is only the ”best” among the class of linear
unbiased estimator
- If we go outside of this class, we can possibly find a better
estimator.
14
Fitted values and residuals
The fitted value for the ith observation is
yˆi = βˆ0 + βˆ1xi1 + βˆ2xi2 + . . .+ βˆpxip = x
>
i βˆ = x
>
i (X
>X)−1X> y
Putting all fitted values together, we have the vector of fitted value
yˆ = X(X>X)−1X> y = Hy .
where the n× n matrix H = X(X>X)−1X> is usually called the
hat matrix. Such matrix play a very important role in the residual
analysis for MLR.
15
Fitted values and residuals
• The residual for the ith observation is
ei = yi − yˆi = yi − h>i y .
where h>i denotes the ith row of H. The vector of residual is
therefore
e = y−X βˆ = y−yˆ = y−Hy = (In −H)y .
• From the normal equations, we have X> e = 0, which implies
n∑
i=1
ei = 0,
n∑
i=1
eixij = 0, j = 1, . . . , p− 1.
16
Partitioning of variability
Similar to what we have in SLR, we have the following
decomposition of the total variability for MLR:
n∑
i=1
(yi − y¯)2 =
n∑
i=1
(yi − yˆi + yˆi − y¯)2
=
n∑
i=1
(yi − yˆi)2 +
n∑
i=1
(yˆi − y¯)2 + 2
n∑
i=1
(yi − yˆi)(yˆi − y¯).
Note that yi − yˆi = ei, so
n∑
i=1
(yi − yˆi)(yˆi − y¯) =
n∑
i=1
ei(yˆi − y¯) =
n∑
i=1
eiyˆi − y¯
n∑
i=1
ei
= e> yˆ − y¯ e> 1 = e>X βˆ − y¯ e> 1 = 0
17
Partitioning of variability
SST = SSE + SSR
SST =
∑n
i=1(yi − y¯)2 Total Sum of Squares
SSE =
∑n
i=1(yi − yˆi)2 Sum of Squared Error (SSE)
SSR =
∑n
i=1(yˆi − y¯)2 Sum of Squared Regression (SSR)
As a result, we have the following partition of variability for MLR:
SST = SSE + SSR
(n− 1) df (n− p) df (p− 1) df
18
Extra (sequential) sum of squares
Because we have more than one covariate in MLR, we usually
further decompose SSR into smaller parts to measure the
contribution of each covariate.
Extra (sequential) sum of squares (ESS or Seq. SS) measures
the increase of SSR when one or several new covariates are added
to the model.
Assume we consider p− 1 potential covariates X1, . . . , Xp−1
• Regardless of how many covariates are in the model, SST is
not changed (why?)
• Hence, an increase in SSR always corresponds to the same
decrease in SSE.
19
Extra (sequential) sum of squares
Hence, by the above partitioning of total variability, we have
SST = SSR(X1) + SSE(X1)
= SSR(X1, X2) + SSE(X1, X2),
where inside the parentheses for each SSR and SSE is the list of
covariates in the model. Therefore, the extra sum of squares of
X2 given X1 is already included in the model, denoted as
SSR(X2|X1) is
SSR(X2|X1) = SSR(X1, X2)− SSR(X1)
= SSE(X1)− SSR(X1, X2).
20
Extra (sequential) sum of squares
By a similar argument, we can define quantities such as
SSR(X3|X1, X2) = SSR(X1, X2, X3)− SSR(X1, X2),
SSR(X3, X2|X1) = SSR(X1, X2, X3)− SSR(X1),
and so forth. As a result, for a model with (p− 1) covariates
X1, Xp−1, we can write
SSR(X1, X2, . . . , Xp−1) = SSR(X1)
+ SSR(X2|X1)
+ SSR(X3|X2, X1)
+ . . .
+ SSR(Xp−1|Xp−2 . . . , X1).
21
Extra (sequential) sum of squares
• Each extra sum of squares has exactly degree of freedom
equal to the number of covariates being considered to be
added into the model.
SSR(X1), SSR(X2|X1), . . ., SSR(Xp−1|X1, . . . , Xp−2) has 1
df, and SSR(X3, X2|X1) has 2 df.
• Finally, the mean of squares is obtained by dividing the sum
of squares with the corresponding df
MSR(X1|X2) = SSR(X1|X2)
1
,
MSR(X2, X3|X1) = SSR(X2, X3|X1)
2
,
MSE(X1, X2, . . . , Xp−1) =
SSE(X1, X2, . . . , Xp−1)
n− p , . . .
22
Extra (sequential) sum of squares
We have the following partitioning of total variabiliity for a MLR
with p− 1 covariates:
df SS MS
X1 1 SSR(X1) MSR(X1)
X2 1 SSR(X2|X1) MSR(X2|X1)
X3 1 SSR(X3|X2, X1) MSR(X3|X2, X1)
. . . . . . . . . . . .
Xp−1 1 SSR(Xp−1|Xp−2, . . . , X1) MSR(Xp−1|Xp−2 . . . X1)
Regression p− 1 SSR(X1, . . . , Xp−1) MSR(X1, . . . , Xp−1)
Residuals n− p SSE(X1, . . . , Xp−1) MSE(X1, . . . , Xp−1)
Total n− 1 SST
All these SS and MS can be obtained from the anova() for a lm
object in R.
23
Extra (sequential) sum of squares
Note that order of covariates matter for the ANOVA table, but not
for the summary table.
24
Estimating σ2
Finally, we estimate the error variance σ2 by
σˆ2 =
1
n− p e
> e = MSE(X1, . . . , Xp−1)
where again e denotes the vector of residual. Note that the
denominator is now n− p, since we have to estimate p coefficients
before estimating σ2. In other words, the degree of freedom of the
sum of squared residual is n− p.
25
Fuel consumption
26
Model Inference
Model Inference
Similar to the case of simple linear regression, we will need to
assume normal MLR to do any inference on the coefficients. In
this case, the estimate βˆ is jointly normal with the vector mean
and variance-covariance matrix as derived.
βˆ ∼ Np(β, σ2V)
where V =
(
X>X
)−1
. As a result, each component of βˆ is
normally distributed
βˆj ∼ N(βj , σ2vj+1,j+1).
27
Inference for a single coefficient
Now, we replace σ2 by the estimates σˆ2, so we obtain the standard
error for each βˆj
se(βˆj) = σˆ
√
vj+1,j+1
and hence the t-statistics
Tj =
βˆj − βj
se(βˆj)
∼ tn−p,
with tn−p being the t-distribution with n− p degrees of freedom.
Therefore, a 100(1− α)% confidence interval for βj is
βˆj ± t1−α/2,n−pse(βˆj).
28
Inference for a single coefficient
Next, hypothesis testing for a single coefficient is also done in the
usual way. Let a be a given constant.
H0 : βj = a vs H1 : βj 6= a or H1 : βj > a or H1 : βj < a
Test statistics:
T =
βˆj − a√
σˆ2vj+1,j+1
∼ tn−p under H0
Compute p-value:
H1 : βj 6= a H1 : β0 > a H1 : β0 < a
p-value 2P (tn−p > |T |) P (tn−p > T ) P (tn−p < T )
Reject H0 if p-value < α (significance level).
29
Inference for a single coefficient
An important detail about all the previous inference for a single
coefficient is that it is under the condition that the model has
already contained all the other predictors.
• The covariance-variance matrix of βˆ is computed given we
have p− 1 predictors in the model.
• As a result, if the number of covariates change, the inference
of the same coefficient can change drastically.
For example, confidence interval for β1 in the model
yi = β0 + β1xi1 + β2xi2 + εi is not the same as confidence interval
for β1 in the model yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi.