Figure above shows the domain of a more complex game. There are 25 grid locations the agent
could be in. A prize could be on one of the corners, or there could be no prize. When the agent
lands on a prize, it receives a reward of 10 and the prize disappears. When there is no prize, for
each time step there is a probability that a prize appears on one of the corners. Monsters can
appear at any time on one of the locations marked M. The agent gets damaged if a monster
appears on the square the agent is on. If the agent is already damaged, it receives a reward of -
10. The agent can get repaired (i.e., so it is no longer damaged) by visiting the repair station
In this example, the state consists of four components: ⟨X,Y,P,D⟩, where X is the X-coordinate of
the agent, Y is the Y-coordinate of the agent, P is the position of the prize (P=0 if there is a prize
on P0, P=1 if there is a prize on P1, similarly for 2 and 3, and P=4 if there is no prize), and D is
Boolean and is true when the agent is damaged. Because the monsters are transient, it is not
necessary to include them as part of the state. There are thus 5×5×5×2 = 250 states. The
environment is fully observable, so the agent knows what state it is in. But the agent does not
know the meaning of the states; it has no idea initially about being damaged or what a prize is.
The agent has four actions: up, down, left, and right. These move the agent one step - usually
one step in the direction indicated by the name, but sometimes in one of the other directions. If
the agent crashes into an outside wall or one of the interior walls (the thick lines near the
location R), it remains where it was and receives a reward of -1.
The agent does not know any of the story given here. It just knows there are 250 states and 4
actions, which state it is in at every time, and what reward was received each time. You need
(i) Build a simulator that replicates the above behaviour of agent moving in the grid
(ii) Then, use Q-learning on the simulator built in (i) to learn the best policy for the
agent to move in this environment.