CS 5100: Foundations of Artificial Intelligence
Foundations of Artificial Intelligence
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS 5100: Foundations of Artificial Intelligence
Q-Learning
1 Q-Learning [10 points]
(Modified from Reinforcement Learning: An Introduction 3.12, 3.13, 3.17, 3.25–3.29 - Note that
there is only one problem in this assignment, with eight sub-questions. Sub-questions C
and F are 2 points, the others are 1 point each.)
In PS6, recall that we stated the Bellman equation for deterministic policies and state-only rewards:
V π(s) = R(s) + γ
∑
s′
T (s, π(s), s′)V π(s′)
and the more general version of the Bellman equation for stochastic policies and general reward func-
tions:
V π(s) =
∑
a
π(a|s)
∑
s′
T (s, a, s′) [R(s, a, s′) + γV π(s′)]
In this question, you will derive similar mathematical identities for Qπ, Q∗, and π∗ for stochastic
policies π(a|s) and general reward functions R(s, a, s′).
Qπ(s, a) represents the “action value” function for a given policy π - in other words, Qπ(s, a) is the
expected value of taking action a in state s.
π∗(s) represents the optimal policy for state s.
Q∗(s, a) represents the formula for the optimal Q-value (or action value) - in other words, the Qπ
function for the optimal policy.
Note: Not all identities are recursive / “Bellman-like”.
(a) Give an equation for V π in terms of Qπ and π(a|s).
Solution:
(b) Give an equation for Qπ in terms of V π, T (s, a, s′), and R(s, a, s′).
Solution:
1
(c) Derive the general Bellman equation forQπ, in terms of π(a′|s′), T (s, a, s′), R(s, a, s′), andQπ(s′, a′).
Solution:
(d) Give an equation for V ∗ in terms of Q∗.
Solution:
(e) Give an equation for Q∗ in terms of V ∗, T (s, a, s′), and R(s, a, s′).
Solution:
(f) Derive the general Bellman optimality equation for Q∗, in terms of T (s, a, s′), R(s, a, s′), and
Q∗(s′, a′).
Solution:
(g) Give an equation for π∗ in terms of Q∗.
Solution:
(h) Give an equation for π∗ in terms of V ∗, T (s, a, s′), and R(s, a, s′).