Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ALY-6020
Predictive Analytics:
Generalized Linear Models
Ajit Appari, Ph.D., M.Tech., B.Tech.
College of Professional Studies, Northeastern University
Email:
[email protected]
November 10, 202
Linear Regression: The Classic Bivariate Model
2
If red line/curve is the
fitted line
Which one is linear
regression model?
A quick review of linear
regression model before
discussing GLM
Multiple Linear Regression Model
For multivariable problem (three or more variables), i.e.,
Response as a function of two or more predictors
where i=1,…,n (sample size); ei is random error for i-th observation
Often xi,1=1 implying β1 is constant term
All n equations stacked together as matrix
3
Linear Regression: The Classic Model
4
Linear Regression Model
Parameters of Linear Regression Model
NOTE-1: Model is linear in parameters, irrespective of the nature of predictors.
NOTE-2: Normal distribution assumption is for error; NOT for Y
Quantitative inputs
Transformations, e.g. ln(x), Sin(x)
Indicator [0/1] or Categorical,
Polynomial, x2, x3
Interactions, X3=X1 . X2
Linear Regression: The Classic Model
Model is linear in parameters
Predictors can be in non-linear relationship with Response variable and yet
such relationship can be estimated using linear regression
Error follows Normal Distribution with zero mean
This does not imply Response variable has to follow normal distribution (a
common mistake among analytics professionals)
Predictors are not related to error
In practice it is difficult to avoid because of missing variables
Predictors are not correlated to each other {No multicollinearity}
In practice, most systems have correlated predictors
Errors are independent and identically distributed
Errors across observations are uncorrelated {cross-sectional/ longitudinal}
Error are homoscedastic {error variance does not vary with predictors values}
5
Homoscedastic Vs Heteroscedastic Errors
6
Linear Regression Estimation: The OLS Approach
7
Minimize RSS: ∑=1 2 = ∑=1 − � 2Estimation Approach: Ordinary Least Square (OLS)
Bivariate Linear Regression Model: Simple version
Residual or Prediction Error =
Deviation of Observed Values
from their Predicted Values on
the Fitted Line
∶ � = + .
: = − �
= − − .
Linear Regression Estimation: The OLS Approach
8
Systematic
Component
Random
Component
Linear Regression Model with Two Covariates
Random
Component
Systematic
Component
Generalized Linear Models
9
Generalized Linear Model: Why Needed
OLS estimation fails if random component follows non-
normal distribution
Maximum likelihood estimation approach is used to predict
linear models
Potential Scenarios of non-normal regression modeling
Binary variable as response { 0 or 1}
Modeled as Logistic
Proportion of total cases as response {ranges from 0 to 1}
Modeled as Binomial distribution
If #cases=1 same as binary
10
Generalized Linear Model: Why Needed
Potential Scenarios of non-normal regression modeling
Count variable as response {non-negative number}
Modeled as Poisson {variance=mean}
Modeled as Negative Binomial { variance> mean}
Poisson for Rates {if denominator/exposure variable is very
large}
Positive continuous variable, e.g. Rates, Service time
Modeled as Gamma
11
Generalized Linear Models
12
One of the oldest agricultural
research center.
Established 1843;
Birthplace of modern statistical
theory and practice
GLM Framework
Generalized Linear Models (GLM) is a framework that:
Extends ordinary linear regression model of continuous response
variable to cases of categorical or discrete variables
Maximum Likelihood Estimation approach (default)
• Iteratively Reweighted Least Squares Method (Nelder & Wedderburn 1972)
GLM has three components:
Random component: Error component follows exponential
dispersion model family
Systematic component: Linear predictor of response
Link function: links the linear predictor to expected mean of
response{unique to the GLM}
13
GLM Framework
Random Component (Error Distribution): Exponential
Dispersion Model Family.
The probability function ; , is defined as
; , = , . −
is called the canonical parameter.
>0 is dispersion parameter {similar to }. Should be very
close to 1;
• under dispersion if <<1; and
• over dispersion if >> 1
is a known function called cumulant function.
, is a normalizing function to ensure ; , is a
probability function, i.e.
• ∫ ; , = for continuous y or
• ∑ ; , = for discrete y
14
GLM Framework
Members of Exponential Dispersion Model Family
Normal {e.g., Sample averages when sample sizes are sufficiently
large}
Bernoulli {e.g., Yes/No decisions}
Binomial {e.g., number of success in ‘n’ trials sum of ‘n’ Bernoulli
trials}
Categorical or Multinomial Logit {e.g., a customer’s race/ethnicity}
Multinomial {e.g., Customer counts in each race/ethnicity out of n
customers}
Exponential {e.g., Waiting Time in a queue or interarrival time}
Gamma {e.g., amount of rainfall in reservoir, Waiting Time till k-th
customer served, customer life-time value, annual health expenditure}
Poisson {e.g., Number of customers in the queue}
Negative Binomial – for over-dispersed count variable {e.g., number of
hospital visits in a year}; generalization of Poisson distribution
15
GLM Framework
Systematic Component (Linear Predictor)
= +
Conditional Expectation = 1, 2 … . ;
Parameter vector = 0, 1 ,2 … . . .
O is offset a parameter known a priori ; commonly occurs in Poisson
GLM but may appear in any GLM.
Offset is measure of exposure (a.k.a. denominator) variable.
• Annual Birth Count across cities can be modeled as Poisson, but this expected
annual count depends on city’s adult population – offset or exposure
• Number of workers with lung diseases in various coal mines depends on the
number of workers and how long they have worked. Offset or exposure would
be number of person-years.
X is model matrix [1,1,2 … . . .], with
• first column fixed at “1” for intercept parameter 0 ,
• s are explanatory variables that includes interactions (3 = 1 ∗ 2 ),
quadratic term 5 = 4 ∗ 4 or polynomial terms 5 = 4 ∗ 4 ∗ ⋯∗ 4
16
GLM Framework
Link Function g[µ] : connects conditional mean of
response to the linear predictor
= = +
Regression parameters are estimated using Maximum
Likelihood