RESULTS WITH SELF-MODIFICATIONS.

Next: A NAVIGATION TASK Up: WRITING VARIABLE SEQUENCES Previous: RESULTS WITHOUT SELF-MODIFICATIONS.

RESULTS WITH SELF-MODIFICATIONS.

At system death, total payoff was about

(recall that the theoretical optimum for a non-learning system with optimal initial bias would be

). To find out whether the incremental self-improvement paradigm did indeed lead to incremental self-improvement, let us have a look at the learning history (the results are slightly different from those reported in [32], where a slightly different implementation led to different calls of the random number generator).

Self-generated reduction of numbers of probability modifications. In the beginning, the system computed a lot of probability modifications but later preferred to decrease the number of probability modifications per time interval. After time steps, there were about 350,000 probability modifications per time steps. After time steps, there were about 40,000 probability modifications per time steps. Towards system death, there were about 20,000 probability modifications per time steps. Most of the useful SSMs computed either one or two probability modifications.

Speed-up of payoff intake. After time steps, the system already behaved much more deterministically than in the beginning. Average payoff per payoff event had increased from 1.4 to 15.8 (the optimal value being 30.0, of course), and the stack had 70 entries. These entries corresponded to 66 modifications of single cell probability distributions, computed by 45 SSMs -- each being more ``useful'' than all the previous ones. Storage already looked very messy. For instance, almost all cells in the work area were filled with (partly big) integers quite different from the initial values. Recall that the storage is never re-initialized and has to be viewed as part of the policy environment.

First maximal payoff. After 1,436,383 payoff events, the system correctly had written all 30 variables for the first time, and received maximal payoff 30.0. Due to remaining non-determinism in the system, the current average payoff per payoff event (measured shortly afterwards, at time step 1,500,000,000) was about 21.7.

After 3,000,000 payoff events, current average payoff per payoff event was 25.6. But the stack had only 206 entries (corresponding to 174 ``useful'' SSMs). After 5,000,000 payoff events (at ``system death''), the current average was about 26.0, with ongoing tendency to increase. By then, there were 224 stack entries. They corresponded to 192 SSMs, each being more ``useful'' than all the previous ones.

Temporary speed-ups of performance improvement. Performance did not increase smoothly during the lifetime of the system. Sometimes, no significant improvement took place for a time interval comparable to the entire learning time so far. Such ``boring'' time intervals were somtimes ended by unexpected sequences of rather quick improvements. Then progress slowed down again. Such temporary speed-ups of performance improvement indicate useful shifts of inductive bias, which may later be replaced by inductive bias created by the next ``breakthrough''.

Evidence of ``learning how to learn''? A look at the stack entries revealed that many (but far from all) useful probability modifications focused on few program cells. Often, SSMs directly changing the probabilities of future SSMs were considered useful. For instance, 9 of the 224 stack entries at time step corresponded to ``useful'' probability modifications of the (self-referential) action of the second program cell. Numerous entries corresponded to ``useful'' modifications of the EndSelfMod probability of various cells. Such stack entries may be interpreted as results of ``adjusting the prior on the space of solution candidates'' or ``fine-tuning search space structure'' or ``learning to create directed mutations'' or ``learning how to learn''.

Next: A NAVIGATION TASK Up: WRITING VARIABLE SEQUENCES Previous: RESULTS WITHOUT SELF-MODIFICATIONS.

Juergen Schmidhuber 2003-02-19