Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MANG 6554
Advanced Analytics
• Gain a basic understanding of Bayesian theorem
• Get familiar with Bayesian classifier (Naïve Bayes and Bayesian belief
networks)
• Develop a philosophical understanding of Bayesian inference and its
relation to statistical model formulations.
LL Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to data mining. Pearson Education India,
2016.
3
Bayes Classifier
A probabilistic framework for solving classification problems
Conditional Probability:
Bayes theorem:
)(
)()|(
)|(
XP
YPYXP
XYP =
)(
),(
)|(
)(
),(
)|(
YP
YXP
YXP
XP
YXP
XYP
=
=
LL
4
Example of Bayes Theorem
Given:
• A salesperson knows that if a customer has already brought a keyboard, 50%
of the chance he/she will buy a mouse.
• Prior probability of any customer purchasing a keyboard is 1/50
• Prior probability of any customer purchasing a mouse is 1/20
If a customer has brought a mouse, what’s the probability
he/she buys a keyboard?
2.0
20/1
50/15.0
)(
)()|(
)|( =
==
mP
kPkmP
mkP
LL
5
Using Bayes Theorem for Classification
Consider each attribute and class label as random variables
Given a record with attributes (X1, X2,…, Xd)
• Goal is to predict class Y
• Specifically, we want to find the value of Y that maximizes P(Y| X1,
X2,…, Xd )
Can we estimate P(Y| X1, X2,…, Xd ) directly from data?
LL
6
Example Data
120K)IncomeDivorced,No,Refund( ===X
Given a Test Record:
Can we estimate
P(Evade = Yes | X) and P(Evade = No | X)?
In the following we will replace
Evade = Yes by Yes, and
Evade = No by No
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
7
Using Bayes Theorem for Classification
Approach:
• compute posterior probability P(Y | X1, X2, …, Xd) using the
Bayes theorem
• Maximum a-posteriori: Choose Y that maximizes
P(Y | X1, X2, …, Xd)
• Equivalent to choosing value of Y that maximizes
P(X1, X2, …, Xd|Y) P(Y)
How to estimate P(X1, X2, …, Xd | Y )?
)(
)()|(
)|(
21
21
21
d
d
n
XXXP
YPYXXXP
XXXYP
=
LL
8
Example Data
120K)IncomeDivorced,No,Refund( ===X
Given a Test Record:
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
9
Naïve Bayes Classifier
Assume independence among attributes Xi when class is given:
• P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)
• Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training
data
• New point is classified to Yj if P(Yj) P(Xi| Yj) is maximal.
LL
10
Naïve Bayes on Example Data
120K)IncomeDivorced,No,Refund( ===X
Given a Test Record:
P(X | Yes) =
P(Refund = No | Yes) x
P(Divorced | Yes) x
P(Income = 120K | Yes)
P(X | No) =
P(Refund = No | No) x
P(Divorced | No) x
P(Income = 120K | No)
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
11
Estimate Probabilities from Data
Class: P(Y) = Nc/N
– e.g., P(No) = 7/10,
P(Yes) = 3/10
For categorical attributes:
P(Xi | Yk) = |Xik|/ Nc
– where |Xik| is number of
instances having attribute
value Xi and belonging to
class Yk
– Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Id Refund Marital
Status
Taxable
income
(K)
Evade (Y)
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
LL
12
Estimate Probabilities from Data
For continuous attributes:
– Discretization: Partition the range into bins:
Replace continuous value with bin value
Attribute changed from continuous to ordinal
– Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, use it to estimate the conditional probability P(Xi|Y)
LL