Non-Markovian variant of Sutton's Markovian maze task , with changing policy environment. The external environment consists of a two-dimensional grid with 9 by 6 fields. denotes the field in the -th row and the -th column. The following fields are blocked by obstacles: , , , , , , . In the beginning, an artificial ``animat'' is placed on (the start field). In addition to the 17 general primitives from Table 1 (not counting input/output primitives), there are four problem-specific primitives with obvious meaning: one-step-north(), one-step-south(), one-step-east(), one-step-west(). The system cannot execute actions that would lead outside the grid or into an obstacle. Again, the following values are written into special cells in the input area whenever they change: IP, sp, remainder. Another input cell is filled with a 1 whenever the animat is on the goal field, otherwise it is filled with a 0. There is only one way the animat can partially observe aspects of the external world: four additional input cells are rewritten after each execution of some problem-specific primitive -- the first (second, third, fourth) cell is filled with if the field to the north (south, east, west) of the animat is blocked or does not exist, otherwise the cell is filled with . Whenever the animat reaches (the goal field), the system receives a constant payoff (100), and the animat is transferred back to (the start field). Parameters for storage size etc. are the same as with the previous task, and time is measured the same way. Clearly, to maximize cumulative payoff, the system has to find short paths from start to goal.
Why this task is non-trivial. Unlike previous reinforcement learning systems, the system does not have a smart initial strategy for temporal credit assignment -- it has to develop its own such strategies. Unlike with Sutton's original set-up, the system does not see a built-in unique representation of its current position on the grid. Its interface to the environment is non-Markovian . This represents one reason why most traditional reinforcement learning systems do not have a sound strategy for solving this task. Another is the changing policy environment. See next paragraph.
Changing policy environment. The fact that the obstacles remain where they are and that the animat is occasionally reset to its start position does not imply that the policy environment does not change. In fact, many actions executed by the system change the contents of storage cells, which are never reset: as mentioned above, storage cells are like an external notebook and have to be viewed as part of the policy environment -- there are no exactly repeatable trials with identical initial conditions.