Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Artificial Intelligence
Assignment 2 - Reinforcement Learning
1 Problem context
Taxi Navigation with Reinforcement Learning: In this assignment,
you are asked to implement Q-learning and SARSA methods for a taxi nav-
igation problem. To run your experiments and test your code, you should
make use of the Gym library1, an open-source Python library for developing
and comparing reinforcement learning algorithms. You can install Gym on
your computer simply by using the following command in your command
prompt:
pip i n s t a l l gym
In the taxi navigation problem, there are four designated locations in the
grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the
episode starts, one taxi starts off at a random square and the passenger is
at a random location (one of the four specified locations). The taxi drives
to the passenger’s location, picks up the passenger, drives to the passenger’s
destination (another one of the four specified locations), and then drops off
the passenger. Once the passenger is dropped off, the episode ends. To show
the taxi grid world environment, you can use the following code:
env = gym .make(”Taxi?v3 ” , render mode=”ans i ” ) . env
s t a t e = env . r e s e t ( )
rendered env = env . render ( )
p r i n t ( rendered env )
In order to render the environment, there are three modes known as
“human”, “rgb array, and “ansi”. The “human” mode visualizes the envi-
ronment in a way suitable for human viewing, and the output is a graphical
window that displays the current state of the environment (see Fig. 1). The
“rgb array” mode provides the environment’s state as an RGB image, and
the output is a numpy array representing the RGB image of the environment.
The “ansi” mode provides a text-based representation of the environment’s
state, and the output is a string that represents the current state of the
environment using ASCII characters (see Fig. 2).
Figure 1: “human” mode presentation for the taxi navigation problem in
Gym library.
You are free to choose the presentation mode between “human” and
“ansi”, but for simplicity, we recommend “ansi” mode. Based on the given
description, there are six discrete deterministic actions that are presented in
Table 1.
For this assignment, you need to implement the Q-learning and SARSA
algorithms for the taxi navigation environment. The main objective for this
assignment is for the agent (taxi) to learn how to navigate the gird-world
and drive the passenger with the minimum possible steps. To accomplish
the learning task, you should empirically determine hyperparameters, e.g.,
the learning rate α, exploration parameters (such as ? or T ), and discount
factor γ for your algorithm. Your agent should be penalized -1 per step it
2
Figure 2: “ansi” mode presentation for the taxi navigation problem in Gym
library. Gold represents the taxi location, blue is the pickup location, and
purple is the drop-off location.
Table 1: Six possible actions in the taxi navigation environment.
Action Number of the action
Move South 0
Move North 1
Move East 2
Move West 3
Pickup Passenger 4
Drop off Passenger 5
takes, receive a +20 reward for delivering the passenger, and incur a -10
penalty for executing “pickup” and “drop-off” actions illegally. You should
try different exploration parameters to find the best value for exploration
and exploitation balance.
As an outcome, you should plot the accumulated reward per episode and
the number of steps taken by the agent in each episode for at least 1000
learning episodes for both the Q-learning and SARSA algorithms. Examples
of these two plots are shown in Figures 3–6. Please note that the provided
plots are just examples and, therefore, your plots will not be exactly like the
provided ones, as the learning parameters will differ for your algorithm.
After training your algorithm, you should save your Q-values. Based on
your saved Q-table, your algorithms will be tested on at least 100 random
grid-world scenarios with the same characteristics as the taxi environment for
both the Q-learning and SARSA algorithms using the greedy action selection