Experiments: Towers of Hanoi and Context-Free Symmetry

Given are disks of different sizes, stacked in decreasing size on the first of three pegs. Moving some peg's top disk to the top of another (possibly empty) peg, one disk at a time, but never a larger disk onto a smaller, transfer all disks to the third peg. Remarkably, the fastest way of solving this famous problem requires moves .

Untrained humans find it hard to solve instances .
Anderson [1]
applied traditional reinforcement learning methods
and was able to solve instances up to , solvable within
at most 7 moves.
Langley [5] used
learning production systems and was able to solve Hanoi instances up to ,
solvable within at most 31 moves.
Traditional **non**learning planning procedures
systematically explore all possible move combinations.
They also fail to solve Hanoi problem instances with ,
due to the exploding search space (Jana Koehler, IBM Research,
personal communication, 2002).
OOPS, however, is searching in program space instead of
raw solution space. Therefore, in principle it should be able to solve
arbitrary instances by discovering
the problem's elegant recursive solution:
given and three pegs (source peg, auxiliary peg, destination peg),
define procedure

The -th task is to solve all Hanoi instances up
to instance . We represent the
dynamic environment for task on the -th task tape,
allocating addresses for each peg,
to store its current disk positions and
a pointer to its top disk (0 if there isn't any).
We represent pegs by numbers 1, 2, 3, respectively.
That is, given an instance of size , we push onto *ds* the values
. By doing so *we insert substantial, nontrivial
prior knowledge* about problem size and the fact
that it is useful to represent each peg by a symbol.

We add three instructions to the 68 instructions of our
FORTH-like programming language:
*mvdsk()* assumes that
are represented by the first three elements
on *ds* above the current base pointer ,
and moves a disk from peg to peg .
Instruction *xSA()* exchanges the representations of and ,
*xAD()* those of and (combinations may create arbitrary
peg patterns).
Illegal moves cause the current program prefix to halt.
Overall success is easily verifiable since
our objective is achieved once the first two pegs are empty.

Within reasonable time (a week) on an off-the-shelf
personal computer (1.5 GHz) the system was not able to solve
instances involving more than 3 disks.
This gives us a welcome opportunity to demonstrate
its **incremental** learning abilities: we
first trained it on an additional, easier task, to
teach it something about recursion, hoping that this would help
to solve the Hanoi problem as well.
For this purpose we used a seemingly unrelated
**symmetry problem** based on
the context free language : given input
on the data stack *ds*, the goal is to
place symbols on the auxiliary stack *Ds* such that
the topmost elements are 1's followed by 2's.
We add two more instructions to the initial programming language:
instruction *1toD()* pushes 1 onto *Ds*, instruction *2toD()* pushes 2.
Now we have a total of five task-specific instructions (including those
for Hanoi), with instruction numbers 1, 2, 3, 4, 5, for
*1toD*, *2toD*, *mvdsk*, *xSA*, *xAD*, respectively.

So we first boost
(Section 3)
instructions *c1, c2*
for the first training phase where
the -th task
is to solve all symmetry problem
instances up to .
Then we undo the symmetry-specific boosts of *c1, c2*
and boost instead
the Hanoi-specific ``instruction number pushers''
for the subsequent training phase where
the -th task (again
)
is to solve all Hanoi instances up to .

**Results.**
Within roughly 0.3 days, OOPS found and
froze code solving the symmetry problem.
Within 2 more days it also found a
universal Hanoi solver, exploiting the benefits of
incremental learning ignored by
**non**incremental HSEARCH and LSEARCH.
It is instructive to study the sequence of intermediate solutions.
In what follows we will transform integer sequences discovered
by OOPS back into readable programs
(to fully understand them, however, one needs to know all
side effects, and which instruction has got which number).

For the symmetry problem, within less than a second,
OOPS found silly but
working code for .
Within less than 1 hour
it had solved the 2nd, 3rd, 4th, and 5th instances,
always simply prolonging the previous code without changing
the start address . The code found so far was unelegant:
*(defnp 2toD grt c2 c2 endnp boostq delD delD
bsf 2toD fromD delD delD delD fromD
bsf by2 bsf by2 fromD delD delD fromD cpnb bsf).*
But it does solve all of the first 5 instances.
Finally, after 0.3 days, OOPS had
created and tested a new, elegant, recursive
program (no prolongation of the previous one)
with a new increased start address
, solving all instances up to 6:
*(defnp c1 calltp c2 endnp).*
That is, it was cheaper to solve all instances up to 6 by discovering
and applying this new program to all instances so far, than just prolonging
old code on instance 6 only.
In fact, the program turns out to be a universal symmetry problem solver.
On the stack, it
constructs a 1-argument procedure that returns nothing
if its input argument is 0,
otherwise calls the instruction *1toD* whose code is 1, then calls
itself with a decremented input argument, then calls *2toD* whose code is 2,
then returns. Using this program,
within an additional 20 milliseconds,
OOPS had also solved the remaining 24 symmetry tasks up to .

Then OOPS switched to the Hanoi problem.
1 *ms* later
it had found trivial code for :
*(mvdsk).* After a day or so
it had found fresh yet bizarre
code (new start address ) for :
*(c4 c3 cpn c4 by2 c3 by2 exec)*.
Finally, after 3 days it had found fresh code (new ) for :
*(c3 dec boostq defnp c4 calltp c3 c5 calltp endnp).*
This already is an optimal universal Hanoi solver.
Therefore, within 1 additional day
OOPS was able to solve
the remaining 27 tasks for up to 30, reusing
the same program
again and again.
Recall that the optimal solution for takes *mvdsk*
operations, and that for each *mvdsk* several other instructions need to be
executed as well!

The final Hanoi solution profits from the
earlier recursive solution to the symmetry
problem. How?
The prefix *(c3 dec boostq)* (probability 0.003)
temporarily rewrites the search procedure (this illustrates the
benefits of **metasearching!**)
by exploiting previous code: Instruction *c3* pushes 3; *dec*
decrements this; *boostq* takes the result 2 as an argument
and thus boosts the probabilities of all components of the 2nd
frozen program, which happens to be the previously found universal
symmetry solver.
This leads to an online bias shift that
greatly increases the probability that
*defnp, calltp, endnp,* will appear
in the suffix of the online-generated program.
These instructions in turn are helpful for building
(on the data stack *ds*)
the double-recursive procedure
generated by the suffix
*(defnp c4 calltp c3 c5 calltp endnp),* which essentially
constructs a 4-argument procedure that returns nothing if its input argument is 0,
otherwise decrements the top input argument, calls the instruction
*xAD* whose code is 4,
then calls itself,
then calls *mvdsk* whose code is 5,
then calls *xSA* whose code is 3,
then calls itself again, then returns
(compare the standard Hanoi solution).

The total probability of the final solution, given the previous
codes, is
.
On the other hand, the probability of the essential Hanoi code
*(defnp c4 calltp c3 c5 calltp endnp)*, given nothing,
is only , which explains why it was not
quickly found without the help of an easier task.
So in this particular setup the **incremental** training due to
the simple recursion for the symmetry problem
indeed provided useful training for the more complex Hanoi recursion,
speeding up the search by a factor of roughly 1000.

The entire 4 day search tested
93,994,568,009 prefixes corresponding to
345,450,362,522 instructions costing
678,634,413,962 time steps
(some instructions cost more than 1 step, in particular, those
making copies of strings with length , or those increasing the
probabilities of more than one instruction).
**Search time of an optimal solver is a natural measure
of initial bias.**
Clearly, most tested prefixes are short -- they either halt or get interrupted soon.
Still, some programs do run for a long time; the longest measured
runtime exceeded 30 billion steps.
The stacks of recursive invocations of
**Try** for storage management (Section 2.1)
collectively never held more than 20,000 elements though.

Different initial bias will yield different results. E.g., we could set to zero the initial probabilities of most of the 73 initial instructions (most are unnecessary for our two problem classes), and then solve all tasks more quickly (at the expense of obtaining a non-universal initial programming language). The point of this experimental section, however, is not to find the most reasonable initial bias for particular problems, but to illustrate the general functionality of the first general near-bias-optimal incremental learner. In ongoing research we are equipping OOPS with neural network primitives and are applying it to robotics. Since OOPS will scale to larger problems in essentially unbeatable fashion, the hardware speed-up factor of expected for the next 30 years appears promising.

Back to OOPS home page