Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CSE 250A FINAL EXAM
Question Points
Academic integrity statement *
1. d-separation 10
2. Polytree inference 10
3. Naı¨ve Bayes vs logistic regression 10
4. Nonnegative random variables 5
5. EM algorithm 10
6. Inference in HMMs 8
Question Points
7. Most likely hidden states 7
8. Gaussian random variables 6
9. Policy improvement 7
10. The simplest MDP 3
11. Noisy parity model 9
12. Gamer rating engine 15
Total: 100
The exam is expected to take one full afternoon, but you may work as long as needed until the deadline. Be sure to
sign and submit the statement of academic integrity with your completed exam. Exams without signed statements
will not be graded.
You are not required to typeset your solutions. We do expect your writing to be legible and your final answers clearly
indicated. Also, please allow sufficient time to upload your solutions.
You are allowed to check your answers with programs in Matlab, Mathematica, Maple, NumPy, etc. But this should
not be necessary, and be aware that these programs may not produce the intermediate steps needed to receive credit.
The later parts of problems often do not depend on the results of earlier ones; therefore it is a good strategy to attempt
all parts of every problem.
An asterisk may indicate the need for a somewhat more laborious calculation. If you do not see the solution quickly,
you may wish to return to these parts later.
If something is unclear, state the assumptions that seem most natural to you and proceed under those assumptions. Out
of fairness, we will not be answering questions about the technical content of the exam on Piazza or by email.
1
Academic Integrity Statement
This take-home, open-book exam represents work that I have completed on my own. In particular,
during the twenty-four hour period of the exam, I hereby affirm1 the following:
(i) I have not communicated with other students in the class about these problems.
(ii) I have not consulted other knowledgeable researchers or acquaintances.
(iii) I have not contracted for help on any parts of the exam over the Internet.
I understand that any of the above actions, if proven, will result in a failing grade on the exam. In
addition, I understand the following:
(i) I am obligated to report any violations of academic integrity by others that come to my
attention during the exam.
(ii) All other students in the course (past and present) are under the same obligation.
Signature Date
1If you do not have a printer, you may simply copy, sign, and date the following statement with your solu-
tions: I affirm the statement of academic integrity on page 2 of the exam.
2
1. d-separation (10 pts)
Consider the following statements of marginal or conditional independence in the belief network
shown below. For each statement, indicate whether it is true or false, and then justify your answer
as shown in the provided examples. Note that credit will only be awarded to answers that are
supported by further explanation.
A
C
B
D
E GF
Examples:
(i) P (F |A,C) ?= P (F |C)
True. There are four paths from node A to node F . They are
A→ C → F ,
A→ E ← C → F ,
A→ C → G← D → F ,
A→ E ← C → G← D → F ,
and each of these paths is blocked by node C. (Some of these paths are also blocked by other
nodes, but the answer is already complete as written.)
(ii) P (C,D|G) ?= P (C|G)P (D|G)
False. The path C → G← D is not blocked.
Problems:
(a) P (B|F ) ?= P (B|E,F )
(b) P (B,E) ?= P (B)P (E)
(c) P (A|B,F ) ?= P (A|F )
(d) P (A,B|E) ?= P (A|E)P (B|E)
(e) P (C|A,E, F,G) ?= P (C|A,B,E, F,G)
3
2. Polytree inference (10 pts)
A B DC
GF H
E
For the belief network shown above, consider how to efficiently compute the conditional probability
P (G|A,C, F,H). This can be done in four consecutive steps in which some later steps rely on the
results from earlier ones.
Complete the procedure below for this inference by showing how to compute the following prob-
abilities. Make each computation as efficient as possible, and briefly justify each step in your
solutions for full credit. Your answers should be expressed in terms of the CPTs of the belief
network and (as needed) the results of previous steps.
(a) Compute P (E|H).
(2 pts)
(b) Compute P (D|H).
(2 pts)
(c) Compute P (B|A,F ).
(3 pts)
(d) Compute P (G|A,C, F,H).
(3 pts)
4
3. Naı¨ve Bayes versus logistic regression (10 pts)
Y
X1 X2 X3 Xd...
(a) Naı¨ve Bayes model (2 pts)
Consider the belief network of discrete random variables shown above. Show how to com-
pute the conditional probability
P (y|x1, x2, . . . , xd)
in terms of the belief network’s probability tables for P (y) and P (xi|y). Justify your steps
to receive full credit.
(b) Log-odds (2 pts)
Consider the special case where all the variables in this belief network are binary-valued.
For this special case, compute the log-odds
log
P (Y =1|x1, x2, . . . , xd)
P (Y =0|x1, x2, . . . , xd)
in terms of the belief network’s probability tables. Simplify your answer as much as possible;
this will be helpful for the next parts of the problem.
(c*) Linear decision boundary (3 pts)
Show that the log-odds from part (b) is a linear function of the values of x1, x2, . . . , xn. In
particular, show that it can be written in the form:
log
P (Y =1|x1, x2, . . . , xd)
P (Y =0|x1, x2, . . . , xd) = a0 +
d∑
i=1
aixi
for appropriately chosen values of a0, a1, . . . , ad. Your solution should express these values
in terms of the belief network’s probability tables.
Hint: since each xi is equal to either 0 or 1, it may be a useful notation to write
log
P (xi|Y =1)
P (xi|Y =0) = xi log
P (Xi=1|Y =1)
P (Xi=1|Y =0) + (1−xi) log
P (Xi=0|Y =1)
P (Xi=0|Y =0) .
5
(d) Logistic regression (2 pts)
Consider whether it is possible to express your result for P (Y =1|x1, x2, . . . , xd) in the form
of a logistic regression. In particular, are there parameters w0, w1, . . . , wd such that
P (Y =1|x1, x2, . . . , xd) = σ
(
w0 +
d∑
i=1
wixi
)
,
where σ(z) = (1 + e−z)−1 is the sigmoid function? If yes, show how to choose these
parameters so that this is true. If not, explain why not.
Hint: What is the inverse of the sigmoid function?
6
4. Nonnegative random variables (5 pts)
Let µ > 0. In this problem you will derive some elementary but useful properties of the exponential
distribution
P (z) =
1
µ
exp
(
− z
µ
)
for a continuous, nonegative random variable with mean µ. (You will be exploiting these proper-
ties to solve the last problem on the exam.)
(a) Log-likelihood (1 pt)
Let {z1, z2, . . . , zT} be an i.i.d. data set of nonnegative values. Assuming each value was
drawn from an exponential distribution, compute the log-likelihood
L(µ) =
T∑
t=1
logP (zt)
in terms of the distribution’s parameter µ.
(b) Maximum likelihood estimation (1 pt)
Show that the maximum likelihood estimate for µ is given by the sample mean of the data.
(c) Cumulative distribution (1 pt)
Calculate the cumulative distribution, given by
P (Z∫ a
0
dz P (z),
when Z is exponentially distributed with mean µ.
(d) Comparison (2 pts)
Suppose that Z1 and Z2 are independent, exponentially distributed random variables with
means µ1 and µ2, respectively. Show that
P (Z1>Z2) =
µ1
µ1 + µ2
,
where the left side denotes the probability that Z1 exceeds Z2 in value.
Hint: note that P (Z1 > Z2) =
∫∞
0
daP (Z1 = a)P (Z2 < a), and use your result from the
previous part of this problem.
7
5. EM algorithm (10 pts)
A B
F
C
ED
Consider the belief network shown at the right,
with observed nodes {A,B,E, F} and hidden nodes {C,D}.
On the problems below, simplify your answers as much as
possible, and briefly justify your steps to receive full credit.
(a) Hidden node C (2 pts)
Show how to compute the posterior probability P (C|A,B,E, F ) in terms of the conditional
probability tables (CPTs) of the belief network.
(b) Hidden node D (2 pts)
Show how to compute the posterior probability P (D|A,B,E, F ) in terms of the CPTs of
the belief network.
(c) Both hidden nodes (1 pt)
Show how to compute the posterior probability P (C,D|A,B,E, F ) in terms of your previ-
ous results.
(d) Log-likelihood (2 pts)
Consider a data set of T partially labeled examples {at, bt, et, ft}Tt=1 over the observed nodes
of the network. The log-likelihood of the data set is given by:
L =
∑
t
logP (A=at, B=bt, E=et, F =ft)
Compute this expression in terms of the CPTs of the belief network.
(e) EM algorithm (3 pts)
Consider the EM updates for the CPTs of this belief network that maximize the log-likelihood
in part (d). Provide these updates for the following:
(i) P (C=c)
(ii) P (D=d|A=a,B=b)
(iii) P (E=e|B=b, C=c)
For this part of the problem, it is not necessary to justify your steps; it is only necessary to
state the correct updates.
8
6. Inference in HMMs (8 pts)
Consider a discrete HMM with the belief network shown below. As usual, let st ∈ {1, 2, . . . , n}
and ot ∈ {1, 2, . . . ,m} denote, respectively, the hidden state and observation at time t; also, let
pii = P (S1= i),
aij = P (St+1=j|St= i),
bik = P (Ot=k|St= i),
denote the initial distribution over hidden states, the transition matrix, and the emission matrix. In
your answers you may also use bi(k) to denote the matrix element bik.
S1 S2 S3 ST-1 ST
O2 O3 OT-1 OT
...
O1
(a) Inference (4 pts)
Let at denotes the tth power (via matrix multiplication) of the transition matrix, and assume
that a0 denotes the identity matrix. Prove by induction or otherwise that
P (St= i) =
n∑
k=1
pik(a
t−1)ki.
(b) More inference (4 pts)
The forward-backward algorithm in discrete HMMs computes the probabilities
αit = P (o1, o2, . . . , ot, St= i),
βit = P (ot+1, ot+2, . . . , oT |St= i).
In terms of these probabilities (which you may assume to be given) and the parameters
(aij, bik, pii) of the HMM, show how to compute the conditional probability
P (o1, o2, . . . , oT |St= i, St+1=j)
for times 1≤ tture.) Show your work for full credit, justifying each step in your derivation, and simplifying
your answer as much as possible. Hint: your result from part (a) may be useful.
9
7. Most likely hidden states (7 pts)
ST-1 ST
OT-1OT-2
S1 S2 S3
O2O1
…
Consider the belief network shown above, where st ∈ {1, 2, . . . , n} and ot ∈ {1, 2, . . . ,m} denote,
respectively, the hidden state and observation at time t. In this problem you will derive an efficient
algorithm for computing the most likely sequence of hidden states
{s∗1, s∗2, . . . , s∗T} = argmax
s1,s2,...,st
P (s1, s2, . . . , st|o1, o2, . . . , ot−1).
Note that this is not a hidden Markov model: in particular, every observation node has two parents,
every hidden node has none, and the joint distribution is given by
P (S1, S2, . . . , ST , O1, O2, . . . , OT−1) =
T∏
t=1
P (St)
T−1∏
t=1
P (Ot|St, St+1).
(a) Forward pass (4 pts)
Given a sequence of T−1 observations, the most likely hidden states are most easily found
by computing the T -column matrix with elements
`∗it = max
s1,s2,...,st−1
logP (s1, s2, . . . , st−1, St= i, o1, o2, . . . , ot−1).
Note that the log probability for `∗it in this network only includes observations up to time t−1.
Give an efficient algorithm to compute these matrix elements for 1 ≤ i ≤ n and 1 ≤ t ≤ T .
(b) Backward pass (3 pts)
Show how to efficiently derive the sequence of most likely hidden states from the results of
the forward pass in part (a).
10
8. Gaussian random variables (6 pts)
z x
z ∼ N (0, I) x ∼ N (Λz, I)
Consider the belief network of multivariate Gaussian random variables shown above, where z ∈ is hidden and x ∈ P (z) =
1
(2pi)d/2
exp
{
− 1
2
z>z
}
,
P (x|z) = 1
(2pi)D/2
exp
{
− 1
2
(x−Λz)>(x−Λz)
}
.
Here, the parameter Λ is a D × d matrix. Note that both these multivariate Gaussian distributions
have an identity covariance matrix.
(a) Posterior mode (2 pts)
Let z∗ = argmaxz P (z|x) denote the mode of the posterior distribution (i.e., the location of
its global maximum). Prove that we may also compute this mode from
z∗ = argmax
z
log
[
P (z)P (x|z)].
(b*) Maximization (3 pts)
Compute z∗ by explicitly maximizing log
[
P (z)P (x|z)] as suggested in part (a). Your an-
swer should express z∗ in terms of the observed value of x and the matrix parameter Λ.
(c) Posterior mean (1 pt)
Consider the following assertion:
The posterior mean E[z|x], defined by the multidimensional integral
E[z|x] =
∫
z∈dzP (z|x) z,
is equal to the posterior mode z∗ = argmaxz P (z|x) that was obtained
(much more simply) by differentiation in part (b).
Is the above statement true or false? If true, explain by appealing to the properties of normal
distributions; if false, provide a counterexample. (You are not meant to evaluate the integral.)
11
9. Policy improvement (7 pts)
Consider the Markov decision process (MDP) with two states s ∈ {0, 1}, two actions a ∈ {↓, ↑},
discount factor γ= 2