EXPERIMENT: SOLVING A DIFFICULT POE TASK

Environment. Figure 1

(click here or see below) shows a partially observable environment (POE) with $600 \times 800$ pixels. This POE has many more states and obstacles than most reported in the POE literature -- for instance, Littman et al.'s largest problem [#!Littman:94!#] involves less than 1000 states. There are two SMP/SSA-based agents A and B. Each has circular shape and a diameter of 30 pixels. At a given time, each is rotated in one of eight different directions. State space size exceeds 10¹³ by far, not even taking into account internal states (IP values) of the agents.

There are also two keys, key A (only useful for agent A) and key B (only for agent B), and two locked doors, door A and door B, the only entries to room A and room B, respectively. Door A (B) can be opened only with key A (B). At the beginning of each ``trial'', both agents are randomly rotated and placed near the northwest corner, all doors are closed, key A is placed in the southeast corner, and key B is placed in room A (see Figure 1

Task. The goal of each agent is to reach the goal in room B. This requires cooperation: (1) agent A must first find and take key A (by touching it); (2) then agent A must go to door A and open it (by touching it) for agent B; (3) then agent B must enter through door A, find and take key B; (4) then agent B must go to door B to open it (to free the way to the goal); (5) then at least one of the agents must reach the goal position. For each turn or move there is a small penalty (immediate reward -0.0001).

Positive reward is generated only if one of the agents touches the goal. This agent's reward is 5.0; the other's is 3.0 (for its cooperation -- note that asymmetric reward introduces competition). Only then a new trial starts. There is no maximal trial length, and the agents have no a priori concept of a trial. Time is not reset when a new trial starts.

Instruction set. Both agents share the same design. Each is equipped with limited ``active'' sight: by executing certain instructions, it can sense obstacles, its own key, the corresponding door, or the goal, within up to 10 ``steps'' in front of it. The step size is 5 pixel widths. The agent can also move forward, turn around, turn relative to its key or its door or the goal. Directions are represented as integers in $\{0, \ldots, 7\}$ : 0 for north, 1 for northeast, 2 for east, etc. Each agent has m=52 program cells, and n_ops=13 instructions (including JumpHome/PLA/evaluation instructions B₀, B₁, B₂ from section 3):

Results without learning. If we switch off SMP's self-modifying capabilities (IncProb has no effect), then the average trial length is 330,000 basic cycles (random behavior).

Results with Q-Learning. Q-learning assumes that the environment is fully observable; otherwise it is not guaranteed to work. Still, some authors occasionally apply Q-learning variants to partially observable environments, sometimes even successfully [#!Crites:96!#]. To test whether our problem is indeed too difficult for Q-learning, we tried to solve it using various $TD(\lambda)$ Q-variants [#!Lin:93!#]. We first used primitive actions and perceptions similar to SMP's instructions. There are 33 possible Q-actions. The first 32 are ``turn to one of the 8 different directions relative to the agent's key/door/current direction/goal, and move 3 steps forward''. The 33rd action allows for turning without moving: ``turn 45 degrees to the right''. These actions are more powerful than SMP's instructions (most combine two actions similar to SMP's into one). There are 2*5 = 10 different inputs telling the agent whether it has/hasn't got its key, and whether the closest object (obstacle, key, door, or goal) part of which is either 10, 20, 30, 40, or 50 pixels in front of the agent is obstacle/key/door/goal/non-existent. All of this corresponds to 10 rows and 33 columns in the Q-tables. Q-learning's parameters are $\lambda = 0.9$ , $\gamma = 1.0$ , and learning rate 0.001 (these worked well for smaller problems Q-learning was able to solve). For each executed action there is an immediate small penalty of -0.0001, to let Q-learning favor shortest paths.

This Q-learning variant, however, utterly failed to achieve significant performance improvement. We then tried to make the problem easier (more observable) by extending the agent's sensing capabilities. Now each possible input tells the agent uniquely whether it has/hasn't got the key, and whether the closest object (obstacle or key or door or goal) part of which is either 10, 20, 30, 40, or 50 pixels away in front of/45 degrees to the right of/45 degrees to the left of the agent is obstacle/key/door/goal/non-existent, and if existing, whether it is 10/20/30/40/50 pixels away. All this can be efficiently coded by $2 \times (4 \times 5 + 1)^3 = 18522$ different inputs corresponding to 18522 different rows in the Q-tables (with a total of 611226 entries). This worked indeed a bit better than the simpler Q-variant. Still, we were not able to make Q-learning achieve very significant performance improvement -- see Figure 2

Results with SMP/SSA. After 10⁹ basic cycles (ca. 130,000 trials), average trial length was 5500 basic cycles (mean of 4 simulations). This is about 60 times faster than the initial random behavior, and roughly $\frac{1}{4}$ to $\frac{1}{3}$ of the optimal speed estimated by hand-crafting a solution (due to the POE setting and the random trial initializations it is very hard to calculate optimal average speed). Typical results are shown in Figures 2

and 3

Q-learning as an instruction for SMP. Q-learning is not designed for POEs. This does not mean, however, that Q-learning cannot be plugged into SMP/SSA as a sometimes useful instruction. To examine this issue, we add $TD(\lambda)$ Q-learning [#!Lin:93!#]. to the instruction set of both agents' SMPs:

**Figure 1:** Traces of interacting learning agents A and B in a partially observable environment with huge state space. Initially, both agents are randomly rotated and placed near the northwest corner. Agent A first moves towards the southeast corner to grab its key, then moves north (dotted trace) to open door A (grey). Simultaneously, agent B moves east to door A and waits for agent A to open it. Then B moves in, grabs key B, turns, and heads towards door B to open it, while agent A also heads southwest in direction of the goal (dotted trace). This time, however, agent B is the one who touches the goal first, because A fails to quickly circumvent the obstacle in the center. All this complex, partly stochastic behavior is learned solely by self-modifications (generated according to the policy itself via IncProb calls), although the strongly delayed reward is provided only if one of the agents touches the goal.
$\begin{figure} \vspace{-2cm} \centerline{\psfig{figure=Environment.ps,width=15c... ...line{\psfig{figure=Environment_text.ps,width=12cm}} \vspace{8.5cm} \end{figure}$

**Figure 2:** *Typical performance of SMP/SSA compared to one of the best Q-learning variants (reward intake sampled at intervals of* 10⁶ instruction cycles, mean of 4 simulations). Q-learning hardly improves, while SMP makes rather quick, substantial progress. Interestingly, adding Q-learning to SMP's instruction set tends to improve late-life performance a little bit (trace of crosses) -- essentially, SMP learns when to trust/ignore the Q-table.
**Figure 3:** *Number of executed self-modifying instructions (IncProbs) plotted against time, sampled at intervals of* $2\times 10^6$ instruction cycles (mean of 3 simulations). In the beginning the agents use self-modifications to rapidly increase self-modification frequency. Towards death they learn that there is not as much to learn any more, and decrease this frequency, thus creating their own ``annealing schedules''.
$\begin{figure}\centerline{\psfig{figure=Perform.ps,height=7cm,width=11cm}} \begi... ...,height=5cm,width=10cm}} \begin{small} \vspace*{-0.2cm}\end{small}\end{figure}$

Observations. Final stack sizes never exceeded 250, corresponding to about 250 surviving SMP-modifications. This is just a small fraction of the about $3.5 \times 10^4$ self-modifications executed during each agent's life. It is interesting to observe how the agents use self-modifications to adapt self-modification frequency itself. Towards death they learn that there is not as much to learn any more, and decrease this frequency (self-generated annealing schedule, see Figure 3

). It should be mentioned that the adaptive distribution of self-modifications is highly non-uniform. It often temporarily focuses on currently ``promising'' individual SMP-components. In other words, the probabilistic self-modifying learning algorithm itself occasionally changes based on previous experience.