next up previous
Next: Experiment 2: Incremental Learning Up: Implementation 1: Plugging LS Previous: Plugging ALS into the

Experiment 1: A Big Partially Observable Maze (POM)

The current section is a prelude to section 3.5 which will address inductive transfer issues. Here we will only show that LS by itself can be useful for POE problems. See also [#!Wiering:96levin!#].

Task. Figure 1 shows a $39 \times 38$-maze with 952 free fields, a single start position (S) and a single goal position (G). The maze has more fields and obstacles than mazes used by previous authors working on POMs -- for instance, McCallum's maze has only 23 free fields [#!McCallum:95!#]. The goal is to find a program that makes an agent move from S to G.

Figure: An apparently complex, partially observable $39 \times 38$-maze with a low-complexity shortest path from start S to goal G involving 127 steps. Despite the relatively large state space, the agent can implicitly perceive only one of three highly ambiguous types of input, namely ``front is blocked or not'', ``left field is free or not'', ``right field is free or not'' (compare list of primitives). Hence, from the agent's perspective, the task is a difficult POE problem. The $S$ and the arrow indicate the agent's initial position and rotation.

Instructions. Programs can be composed from 9 primitive instructions. These instructions represent the initial bias provided by the programmer (in what follows, superscripts will indicate instruction numbers). The first 8 instructions have the following syntax : REPEAT step forward UNTIL condition $Cond$, THEN rotate towards direction $Dir$.
Instruction 1 : $Cond$ = front is blocked, $Dir$ = left.
Instruction 2 : $Cond$ = front is blocked, $Dir$ = right.
Instruction 3 : $Cond$ = left field is free, $Dir$ = left.
Instruction 4 : $Cond$ = left field is free, $Dir$ = right.
Instruction 5 : $Cond$ = left field is free, $Dir$ = none.
Instruction 6 : $Cond$ = right field is free, $Dir$ = left.
Instruction 7 : $Cond$ = right field is free, $Dir$ = right.
Instruction 8 : $Cond$ = right field is free, $Dir$ = none.
Instruction 9 is: Jump(address, nr-times). It has two parameters: nr-times $\in {1, 2, \ldots, MAXR}$ (with the constant MAXR representing the maximum number of repetitions), and address $\in {1, 2, \ldots, top}$, where $top$ is the highest address in the current program. Jump uses an additional hidden variable nr-times-to-go which is initially set to nr-times. The semantics are: If nr-times-to-go $> 0$, continue execution at address address. If $0 <$ nr-times-to-go $< MAXR$, decrement nr-times-to-go. If nr-times-to-go $= 0$, set nr-times-to-go to nr-times. Note that nr-times $=MAXR$ may cause an infinite loop. The Jump instruction is essential for exploiting the possibility that solutions may consist of repeatable action sequences and ``subprograms'', thus having low algorithmic complexity [#!Kolmogorov:65!#,#!Chaitin:69!#,#!Solomonoff:64!#]. LS' incrementally growing time limit automatically deals with those programs that do not halt, by preventing them from consuming too much time.

As mentioned in section 3.1, the probability of a program is the product of the probabilities of its constituents. To deal with probabilities of the two Jump parameters, we introduce two additional variable matrices, $\bar{P}$ and $\hat{P}$. For a program with $l \leq k$ instructions, to specify the conditional probability $\bar{P_{ij}}$ of a jump to address $a_j$, given that the instruction at address $a_i$ is Jump ( $i \in {1, ..., l}$, $j \in {1, ...,l}$), we first normalize the entries $\bar{P_{i1}}$, $\bar{P_{i2}}$, ..., $\bar{P_{il}}$ (this ensures that the relevant entries sum up to 1). Provided the instruction at address $a_i$ is Jump, for $i \in {1, ..., k}$, $j
\in {1, ...,MAXR}$, $\hat{P_{ij}}$ specifies the probability of the nr-times parameter being set to $j$. Both $\bar{P}$ and $\hat{P}$ are initialized uniformly and are adapted by ALS just like $P$ itself.

Restricted LS-variant. Note that the instructions above are not sufficient to build a universal programming language -- the experiments in this paper are confined to a restricted version of LS. From the instructions above, however, one can build programs for solving any maze in which it is not necessary to completely reverse the direction of movement (rotation by 180 degrees) in a corridor. Note that it is mainly the Jump instruction that allows for composing low-complexity solutions from ``subprograms'' (LS provides a sound way for dealing with infinite loops).

Rules. Before LS generates, runs and tests a new program, the agent is reset to its start position. Collisions with walls halt the program -- this makes the problem hard. A path generated by a program that makes the agent hit the goal is called a solution. The agent is not required to stop at the goal -- there are no explicit halt instructions.

Why is this a POE problem? Because the instructions above are not sufficient to tell the agent exactly where it is: at any given time, the agent can perceive only one of three highly ambiguous types of input (by executing the appropriate primitive): ``front is blocked or not'', ``left field is free or not'', ``right field is free or not'' (compare list of primitives at the beginning of this section). Some sort of memory is required to disambiguate apparently equal situations encountered on the way to the goal. Q-learning, for instance, is not guaranteed to solve POEs. Our agent, however, can use memory implicit in the state of the execution of its current program to disambiguate ambiguous situations.

Measuring time. The computational cost of a single Levin search call in between two Adapt calls is essentially the sum of the costs of all the programs it tests. To measure the cost of a single program, we simply count the total number of forward steps and rotations during program execution (this number is of the order of total computation time). Note that instructions often cost more than 1 step. To detect infinite loops, LS also measures the time consumed by Jump instructions (one time step per executed Jump). In a realistic application, however, the time consumed by a robot move would by far exceed the time consumed by a Jump instruction -- we omit this (negligible) cost in the experimental results.

Comparison. We compare LS to three variants of Q-learning [#!WatkinsDayan:92!#] and random search. Random search repeatedly and randomly selects and executes one of the instructions (1-8) until the goal is hit (like with Levin search, the agent is reset to its start position whenever it hits the wall). Since random search (unlike LS) does not have a time limit for testing, it may not use the jump - this is to prevent it from wandering into infinite loops. The first Q-variant uses the same 8 instructions, but has the advantage that it can distinguish all possible states (952 possible inputs -- but this actually makes the task much easier, because it is no POE problem any more). The first Q-variant was just tested to see how much more difficult the problem becomes in the POE setting. The second Q-variant can only observe whether the four surrounding fields are blocked or not (16 possible inputs), and the third Q-variant receives as input a unique representation of the five most recent executed instructions (37449 possible inputs -- this requires a gigantic Q-table!). Actually, after a few initial experiments with the second Q-variant, we noticed that it could not use its input for preventing collisions (the agent always walks for a while and then rotates; in front of a wall, every instruction will cause a collision -- compare instruction list at the beginning of this section). To improve the second Q-variant's performance, we appropriately altered the instructions: each instruction consists of one of the 3 types of rotations followed by one of the 3 types of forward walks (thus the total number of instructions is 9 -- for the same reason as with random search, the jump instruction cannot be used). Q-learning's reward is 1.0 for finding the goal and -0.01 for each collision. The parameters of the Q-learning variants were first coarsely optimized on a number of smaller mazes which they were able to solve. We set $c = 0.005$, which means that in the first phase ($Phase$ = 1 in the LS procedure) a program may execute up to 200 steps before being stopped. We set $MAXR = 6$.

Typical result. In the easy, totally observable case, Q-learning took on average 694,933 steps (10 simulations were conducted) to solve the maze in Figure 1. However, as expected, in the difficult, partially observable cases, neither the two Q-learning variants nor random search were ever able to solve the maze within 1,000,000,000 steps (5 simulations were conducted). In contrast, LS was indeed able to solve the POE: LS required 97,395,311 steps to find a program $q$ computing a 127-step shortest path to the goal in Figure 1. LS' low-complexity solution $q$ involves two nested loops:

1) REPEAT step forward UNTIL left field is free$^5$
2) Jump (1 , 3)$^9$
3) REPEAT step forward UNTIL left field is free, rotate left$^3$
4) Jump (1 , 5)$^9$

In words: Repeat the following action sequence 6 times: go forward until you see the fifth consecutive opening to the left; then rotate left. We have $D_P(q) = \frac{1}{9} \frac{1}{9}
\frac{1}{4} \frac{1}{6} \frac{1}{9} \frac{1}{9} \frac{1}{4} \frac{1}{6}
= 2.65 * 10^{-7}$.

Similar results were obtained with many other mazes having non-trivial solutions with low algorithmic complexity. Such experiments illustrate that smart search through program space can be beneficial in cases where the task appears complex but actually has low-complexity solutions. Since LS has a principled way of dealing with non-halting programs and time-limits (unlike, e.g., ``Genetic Programming''(GP)), LS may also be of interest for researchers working in GP [#!Cramer:85!#,#!Dickmanns:87!#,#!Koza:92!#,#!Fogel:66!#].

ALS: single tasks versus multiple tasks. If we use the adaptive LS extension (ALS) for a single task as the one above (by repeatedly applying LS to the same problem and changing the underlying probability distribution in between successive calls according to section 3.2), then the probability matrix rapidly converges such that late LS calls find the solution almost immediately. This is not very interesting, however -- once the solution to a single problem is found (and there are no additional problems), there is no point in investing additional efforts into probability updates (unless such updates lead to an improved solution -- this would be relevant in case we do not stop LS after the first solution has been found). ALS is more interesting in cases where there are multiple tasks, and where the solution to one task conveys some but not all information helpful for solving additional tasks (inductive transfer). This is what the next section is about.


next up previous
Next: Experiment 2: Incremental Learning Up: Implementation 1: Plugging LS Previous: Plugging ALS into the
Juergen Schmidhuber 2003-02-25


Back to Optimal Universal Search page
Back to Reinforcement Learning page
Back to Program Evolution page