next up previous
Next: SSA FOR RECURRENT NEURAL Up: A NAVIGATION TASK Previous: RESULTS WITHOUT SELF-MODIFICATIONS.

RESULTS WITH SELF-MODIFICATIONS.

At system death (at time $10^{10}$), average trial length (including time required for SSA's pushing and popping processes, of course) was down to 56.6 time steps -- more than 200 times faster than without self-improvement, and with ongoing tendency to decrease. Comparable results were obtained with additional runs.

As with the previous task, performance did not improve smoothly. The history of broken records reflects the history of performance improvements (the results from the run reported below are slightly different from those reported in [32], where a slightly different implementation led to different calls of the random number generator).

First, there was a rather quick sequence of improvements which lasted until time $4.6*10^7$. By then (after 155,741 payoff events), the shortest trial so far had taken 44 time steps. Then, the ``current record'' did not improve any more for a comparatively long time interval: $1.14 * 10^8$ time steps -- the length of this ``boring'' time interval by far exceeded the entire previous learning time.

Sudden improvement speed-up. Then, quite unexpected to the observer, the system started to create a new sequence of additional improvements around time step $1.6 * 10^8$. At time $1.6 * 10^8$, the record was down to 42. At time $1.7 * 10^8$, the record was down to 40. At time $1.9 * 10^8$, the record was down to 38. At time $2.0 * 10^8$, the record was down to 36. Then, performance improvement slowed down again.

Similarly, much later, at time $2.4 * 10^9$, the record was 26. Then, not much happened during the next $10^9$ time steps. Suddenly, a new sequence of additional improvements started around time step $3.4 * 10^9$. At time $3.429 * 10^9$, the record was down to 24. At time $3.432 * 10^9$, the record was down to 22. At time $3.486 * 10^9$, the record was down to 20. Throughout this flurry of broken records, the number of stack entries steadily increased to 199. Apparently, the system made a little ``revolutionary'' discovery that permitted a sequence of smoother, additional directed self-mutations.

Final system state. At system death (time step $10^{10}$), the record was down to 19. The system's average payoff intake per time interval still had a tendency to increase. All in all, there were nearly $5*10^6$ SSMs during system life. The effects of only 200 of them, however, turned out to be worth keeping (the others were countermanded by SSA): these 200 computed a total of 235 valid probability modifications (corresponding to only 235 stack entries). Why so few? Again, the reason is: ``useful'' SSMs are rare, because each has to lead to ``better'' results (faster reinforcement intake) than all previous ones.

Messy storage environment. In the end, the storage cells (which are part of the environment) looked very messy. Almost all work cells were filled with (partly big) integers quite different from their initial zero values. Still, without storage cells, the system could not achieve its dramatic performance improvement. It actually learned to make use of the changing policy environment.

Evidence of ``learning how to learn''? As with the previous task, many useful SSMs directly computed valid modifications of probabilities of future SSMs (``learning to learn''). Compare the final paragraph of section 3.1.2.

Stability of probability modifications. With the experiments conducted so far, the top level hardly ever countermanded probability modifications other than the 10 most recent valid ones. For instance, once there were 120 stack entries, the 100 oldest stack entries appeared extremely safe and had a good chance to survive the entire system life.

Revolutions. In the tasks above, unexpected temporary speed-ups of performance improvements were observed. Even if the system appears to be stuck for a long time, the external observer never can be sure that it will not suddenly discover a new, ``revolutionary'' shift of bias that builds the basis for additional, smoother performance improvements. This is analoguous to the history of science itself. Informally, a ``revolution'' corresponds to a self-improvement with high ``conceptual jump size'' (an expression coined by Solomonoff [46]). One nice thing about open-ended incremental self-improvement is that there is no significant theoretical limit to the nature of the revolutions and to what the system may learn. This is, of course, due to the general nature of the underlying programming language.

Inserting prior bias. The few experiments above were designed to illustrate basic principles of the paradigm. They were based on low-level, assembler-like instructions (making even apparently simple tasks difficult -- additional experiments using such low-level instructions can be found in [11,53,40,42]). They are certainly not meant to convince the reader that from now on, he should combine the incremental self-improvement paradigm with the low-level programming language from section 2 and apply it to real world problems: of course, with large scale problems, it is desirable to insert prior knowledge into the system (if such knowledge is indeed available). With incremental self-improvement, a priori knowledge resides in the programmer's selection of primitives with problem-specific built-in bias (and in the payoff function he chooses). There is no reason why certain primitives should not be complex, time consuming programs by themselves, such as statistic classifiers, neural net learning algorithms, logic programs, etc. For instance, using different primitives for the navigation task from section 3.2 can greatly reduce the time required to achieve near-optimal trials. This paper, however, is not a study of the effects of different kinds of initial bias.


next up previous
Next: SSA FOR RECURRENT NEURAL Up: A NAVIGATION TASK Previous: RESULTS WITHOUT SELF-MODIFICATIONS.
Juergen Schmidhuber 2003-02-19