The final 10-token Hanoi solution demonstrates the benefits of incremental learning: it greatly profits from rewriting the search procedure with the help of information conveyed by the earlier recursive solution to the -problem. How?
The prefix (c3 dec boostq) (probability 0.003) prepares the foundations: Instruction c3 pushes 3; dec decrements this; boostq takes the result 2 as an argument (interpreted as an address) and thus boosts the probabilities of all components of the 2nd frozen program, which happens to be the previously found universal -solver. This causes an online bias shift on the space of possible suffixes: it greatly increases the probability that defnp, calltp, endnp, will appear in the remainder of the online-generated program. These instructions in turn are helpful for building (on the data stack ds) the double-recursive procedure generated by the suffix (defnp c4 calltp c3 c5 calltp endnp), which essentially constructs (on data stack ) a 4-argument procedure that returns nothing if its input argument is 0, otherwise decrements the top input argument, calls the instruction xAD whose code is 4, then calls itself on a copy of the top 4 arguments, then calls mvdsk whose code is 5, then calls xSA whose code is 3, then calls itself on another copy of the top 4 arguments, then makes yet another (unnecessary) argument copy, then returns (compare the standard Hanoi solution).
The total probability of the final solution, given the previous codes, is calculated as follows: since , given the boosts of c3, c4, c5, by2, dec, boostq, we have probability for the prefix (c3 dec boostq); since this prefix further boosts defnp, c1, calltp, c2, endnp, we have probability for the suffix (defnp c4 calltp c3 c5 calltp endnp). That is, the probability of the complete 10-symbol code is . On the other hand, the probability of the essential Hanoi-specific suffix (defnp c4 calltp c3 c5 calltp endnp), given just the initial boosts, is only , which explains why it was not quickly found without the help of the solution to an easier problem set. (Without any initial boosts its probability would actually have been similar: .) This would correspond to a search time of several years, as opposed to a few days.
So in this particular setup the simple recursion for the -problem indeed provided useful incremental training for the more complex Hanoi recursion, speeding up the search by a factor of 1000 or so.
On the other hand, the search for the universal solver for all -problems (first found with instance ) did not at all profit from solutions to earlier solved tasks (although instances did profit).
Note that the time spent by the final 10-token Hanoi solver on increasing the probabilities of certain instructions and on constructing executable code on the data stack (less than 50 time steps) quickly becomes negligible as the Hanoi instance size grows. In this particular application, most time is spent on executing the code, not on constructing it.
Once the universal Hanoi solver was discovered, why did the solution of the remaining Hanoi tasks substantially increase the total time (by roughly 25 %)? Because the sheer runtime of the discovered, frozen, near-optimal program on the remaining tasks was already comparable to the previously consumed search time for this program, due to the very nature of the Hanoi task: Recall that a solution for takes more than a billion mvdsk operations, and that for each mvdsk several other instructions need to be executed as well. Note that experiments with traditional reinforcement learners  rarely involve problems whose solution sizes exceed a few thousand steps.
Note also that we could continue to solve Hanoi tasks up to . The execution time required to solve such instances with an optimal solver greatly exceeds the search time required for finding the solver itself. There it does not matter much whether OOPS already starts with a prewired Hanoi solver, or first has to discover one, since the initial search time for the solver becomes negligible anyway.
Of course, different initial bias can yield dramatically different results. For example, using hindsight we could set to zero the probabilities of all 73 initial instructions (most are unnecessary for the 30 Hanoi tasks) except for the 7 instructions used by the Hanoi-solving suffix, then make those 7 instructions equally likely, and thus obtain a comparatively high Hanoi solver probability of . This would allow for finding the solution to the 10 disk Hanoi problem within less than an hour, without having to learn easier tasks first (at the expense of obtaining a nonuniversal initial programming language). The point of this experimental section, however, is not to find the most reasonable initial bias for particular problems, but to illustrate the basic functionality of the first general, near-bias-optimal, incremental learner.