Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ECE 368
Lab1 Overview 0 / 5
Question 1: Classifying Spam vs non-Spam Emails
We want to solve a binary classification problem for detecting spam vs non-spam
emails.
We have a training set containing N emails, and each email n is represented by
{xn, yn}, n = 1, 2, . . . , N , where yn is the class label which takes the value
yn =
{
1 if email n is spam,
0 if email n is non-spam (also called ham),
and xn is a feature vector of the email n.
Let W = {w1, w2, . . . , wD} be the set of the words (called the vocabulary) that
appear at least once in the training set.
The feature vector xn is defined as a D-dimensional vector
xn = [xn1, xn2, . . . , xnD], where each entry xnd, d = 1, 2, . . . , D is the number of
occurrences of word wd in email n. Thus the total number of words in email n can
be expressed as ln = xn1 + xn2 + . . .+ xnD.
Lab1 Overview 1 / 5
Question 1: Classifying Spam vs non-Spam Emails
What is the Naïve Bayes Classifier
Recall the Bayes rule
p(y|x) = p(x|y)p(y)
p(x)
We assume that the prior class distribution p(yn) is modeled as
p(yn = 1) = π,
p(yn = 0) = 1− π,
where π is a fixed parameter (e.g., 0.5).
p(x|y) is unknown in this formula and we need to learn it from the data.
Assumption
p(x = [xn1, xn2, . . . , xnD]|y) = p(x1|y)p(x2|y) . . . p(xD|y).
We have a discrete probability space. Why?
We want to learn P (x = wi|y = j) for i ∈ {1, . . . , D} and j ∈ {0, 1} from the
training data.
Lab1 Overview 2 / 5
Question 1: Classifying Spam vs non-Spam Emails
What is the Naïve Bayes Classifier
Recall the Bayes rule
p(y|x) = p(x|y)p(y)
p(x)
We assume that the prior class distribution p(yn) is modeled as
p(yn = 1) = π,
p(yn = 0) = 1− π,
where π is a fixed parameter (e.g., 0.5).
p(x|y) is unknown in this formula and we need to learn it from the data.
Assumption
p(x = [xn1, xn2, . . . , xnD]|y) = p(x1|y)p(x2|y) . . . p(xD|y).
We have a discrete probability space. Why?
We want to learn P (x = wi|y = j) for i ∈ {1, . . . , D} and j ∈ {0, 1} from the
training data.
Lab1 Overview 2 / 5
Question 1: Classifying Spam vs non-Spam Emails
What is the Naïve Bayes Classifier
Recall the Bayes rule
p(y|x) = p(x|y)p(y)
p(x)
We assume that the prior class distribution p(yn) is modeled as
p(yn = 1) = π,
p(yn = 0) = 1− π,
where π is a fixed parameter (e.g., 0.5).
p(x|y) is unknown in this formula and we need to learn it from the data.
Assumption
p(x = [xn1, xn2, . . . , xnD]|y) = p(x1|y)p(x2|y) . . . p(xD|y).
We have a discrete probability space. Why?
We want to learn P (x = wi|y = j) for i ∈ {1, . . . , D} and j ∈ {0, 1} from the
training data.
Lab1 Overview 2 / 5
Question 1: Classifying Spam vs non-Spam Emails
What is the Naïve Bayes Classifier
Recall the Bayes rule
p(y|x) = p(x|y)p(y)
p(x)
We assume that the prior class distribution p(yn) is modeled as
p(yn = 1) = π,
p(yn = 0) = 1− π,
where π is a fixed parameter (e.g., 0.5).
p(x|y) is unknown in this formula and we need to learn it from the data.
Assumption
p(x = [xn1, xn2, . . . , xnD]|y) = p(x1|y)p(x2|y) . . . p(xD|y).
We have a discrete probability space. Why?
We want to learn P (x = wi|y = j) for i ∈ {1, . . . , D} and j ∈ {0, 1} from the
training data.
Lab1 Overview 2 / 5
What is the probabilistic model: Mutlinomial Distribution
p (xn | yn) = (xn1 + xn2 + . . .+ xnD)!
(xn1)!(xn2)! . . . (xnD)!
D∏
d=1
p (wd | yn)xnd .
Objectives:
1 You want to use maximum likelihood estimates for learning p(x = wi|y = j) for
i ∈ {1, . . . , D} and j ∈ {0, 1}.
2 The maximum likelihood estimates are not the most appropriate to use when the
probabilities are very close to 0 or to 1. For example, some words that occur in one
class may not occur at all in the other class. In this problem, we use the technique
of Laplace smoothing to deal with this problem.
3 What is the technique of Laplace smoothing?
4 After learning p(x = wi|y = j) for i ∈ {1, . . . , D} and j ∈ {0, 1} we want to use it
for classification of the test set.
5 The classification is based on MAP rule.
yˆn =
{
1 ifp(y = 1|x) ≥ p(y = 0|x),
0 if p(y = 1|x) < p(y = 0|x),
6 There are two types of errors in classifying unlabeled emails: Type 1 error is defined
as the event that a spam email is misclassified as ham; Type 2 error is defined as the
event that a ham email is misclassified as spam. How to tradeoff these two errors?
Lab1 Overview 3 / 5
Question 2:Linear/Quadratic Discriminant Analysis for Height/Weight Data
We want to solve a binary classification problem.
Let xn = [hn, wn] be the feature vector, where hn denotes the height and wn
denotes the weight of a person indexed by n. Let yn denote the class label. Here
yn = 1 is male, and yn = 2 is female. We model the class prior as p(yn = 1) = π
and p(yn = 2) = 1− π. For this problem, let π = 0.5.
When the feature vector is real-valued (instead of binary), a Gaussian vector model
is appropriate, i.e.,
p(x|yn = c) ∝ 1|Σc|e
− 1
2
(x−µc)TΣ−1c (x−µc), c ∈ {f,m}. (1)
For LDA, a common covariance matrix is shared by both classes, which is denoted
by Σ; for QDA, different covariance matrices are used for male and female, which
are denoted by Σm and Σf , respectively.
For LDA: estimate µm,µf , and Σ.
For QDA: estimate µm,µf ,Σm, and Σf .
Lab1 Overview 4 / 5
LDA and QDA
Training: We want to use the ML to estimate the LDA/QDA parameters.
Based on the Bayes classifier, we then want to plot the decision boundary in both
cases. What is the difference between LDA and QDA?
Testing: Compute the misclassification rate for both cases.