Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS 443 RL Homework
• Please submit a single pdf file that includes both the written part and the coding part. Please
convert the iPython notebook (containing both your code and the outputs) into pdf, and append
the pdf to that of your written work.
1. (8 pts) (Programming) See the python file in the assignment attachments.
2. (7pts) Importance Sampling Let s1, a1, r1, s2, a2, r2, . . . , sH , aH , rH be a random trajectory generated
by sampling s1 ∼ d0 and taking actions at ∼ πb(st) for all t ≥ 1, where πb is a stochastic behavior policy.
For simplicity assume that the process always terminates in H steps.
Both d0 and πb are exploratory, in the sense that: (1) d0(s) > 0, ∀s ∈ S. (2) πb(a|s) > 0, ∀s ∈ S, a ∈ A.
All expectations in this problem are w.r.t. the distribution of this random trajectory.
Let V and Q be an arbitrary state-value and Q-value function, respectively. Consider each of the
following objective functions over V or Q. Let π be some other policy, which may assign 0 probability to
some actions. Let ρt =
π(at|st)
πb(at|st) and ρ1:t be the cumulative product.
1. (1 pts) L(V ) = E[(V (s1)−
∑H
t=1 γ
t−1ρ1:trt)2].
2. (2 pts) L(V ) = E[
(
V (s1)− ρ1 ·
∑H
t=1 γ
t−1rt
)2
].
3. (2 pts) L(Q) = E[
(
Q(s1, a1)− ρ2 ·
∑H
t=1 γ
t−1rt
)2
].
4. (2 pts) L(Q) = E[ ρ1 ·
(
Q(s1, a1)−
∑H
t=1 γ
t−1rt
)2
].
Express the V (or Q) that minimizes each objective using symbols from the following list: (“(·)” can be
either π or πb)
s, a, d0, πb, π, V
(·), Q(·), T (·)
If the minimizer is not unique and can take multiple values in a state (or state-action forQ), you can write
“arbitrary”. Provide a brief derivation/justification for your answers, and while the use of mathematical
language is encouraged, you are not required to provide a full proof.
Example Consider the objective function L(V ) = E[
(
V (s1)−
∑H
t=1 γ
t−1rt
)2
]. For illustration purposes
let’s assume for now that d0 is not exploratory, i.e., ∃s, s.t. d0(s) = 0. The answer is:
V (s) =
{
V πb(s), if d0(s) > 0,
arbitrary, if d0(s) = 0.
1
Justification: on any s where d0(s) = 0, the value of V (s) does not affect the objective, so the minimizer
V (s) is arbitrary. Otherwise, the minimizer should be E[
∑H
t=1 γ
t−1rt|s1 = s] = V πb(s).
2
3. (5 pts) Policy Gradient with Softmax Policy Consider the following MDP M = {S = {s1, s2},A =
{a1, a2}, P,R, γ = 0.8}, where s1 is the fixed initial state, P and R is defined below:
s1 s2
(s1, a1) 1 0
(s1, a2) 0 1
(s2, a1) 1 0
(s2, a2) 0 1
Reward
(s1, a1) 0.5
(s1, a2) 0
(s2, a1) 0
(s2, a2) 1.0
Table 1: Transition Matrix (Left) and Reward Function (Right)
The softmax policy is defined as:
πθ(a|s) = exp(θs,a)∑
a′∈A exp(θs,a′)
, ∀s ∈ S, a ∈ A
where the policy is parameterized by θ ∈ R|S|×|A| = R4 and θs,a denotes the component of θ indexed by
s and a.
1. (2 pts) Initialize with θ0 = (0, 0, 0, 0)T , compute J(πθ0) = V
πθ0 (s1) and
∇θJ(πθ)|θ=θ0 =
1
1− γE(s,a)∼dπθ0 [Q
πθ0 (s, a)∇θ log πθ(a|s)|θ=θ0 ]
where dπ is the normalized discounted state action occupancy, Qπ is the Q function. (Hint: you
can use the following fact to compute the occupancy:
dπ(s) = (1− γ)d0(s) + γ
∑
s′∈S,a∈A
π(a|s′)P (s|s′, a)dπ(s′), ∀s ∈ S
where d0(s) is the initial state distribution)
2. (2 pts) Perform one step policy gradient with learning rate α = 1.0 on θs1,a1 and θs1,a2 (while fixing
θs2,a1 and θs2,a2). Is J(πθ) improved after the update? (Hint: you do not need to compute J(π)
exactly after the update to make the comparison. Instead, you may treat x := π(a1|s1) ∈ (0, 1) as a
variable and represent J(π) as a function of x, then examine how J(π) changes after the update, as
a function of x.)
3. (1 pt) It’s easy to see the optimal softmax policy is π(a1|s1) ≈ 0, π(a2|s2) ≈ 1 (note that they cannot
be exactly 0 or 1 because exp(·) > 0). Show that for arbitrary α > 0, after doing one step policy
gradient θ1 = θ0+α∇θJ(πθ)|θ=θ0 , we always have πθ1(a1|s1) > πθ0(a1|s1), i.e. the distance between
πθ1(·|s1) and optimal policy increases.
Provide a brief discussion of why/how this seeming ”contradiction” happens: although the policy
value is improved, the policy at some states become more different from the optimal policy.
3
(Space for Q3)
4
4. (Optional, 3 pts) Deterministic Policy Gradient Theorem. In the class we derived the following
stochastic policy gradient (PG) theorem (stochastic policy has the form πθ(a|s), i.e. π takes a state as
input and maps to a distribution on the action space):
∇θJ(πθ) = 1
1− γE(s,a)∼dπθ [(∇θ log πθ(a|s))Q
πθ(s, a)] .
Here you are asked to prove the deterministic policy gradient theorem (deterministic policy has the
form πθ(s), i.e. π takes a state as input and maps to a single action).
∇θJ(πθ) = 1
1− γEs∼dπθ
[∇θπθ(s)∇aQπθ(s, a)|a=πθ(s)] .
Here, we typically have a continuous action space A, where A is a subset of some Euclidean space
(which is why we can take gradient w.r.t. a,∇a, above). You can also assumeP (s′|s, a),∇aP (s′|s, a), πθ(s),∇θπθ(s),∇aR(s, a), d(s)
is sufficiently smooth in input argument(s) s, a, s′.
Hint: Since J(πθ) =
∫
S d0(s)V
πθ(s)ds, it makes things easier to start with computing ∇θV πθ(s). You
can apply the Bellman equation recursively when calculating this quantity.