Next: Other Experiments with IS: Up: Implementation 2: Incremental Self-Improvement Previous: Policy and Program Execution

## Experiment 3: Function Approximation / Inductive Transfer

This experimental case study will demonstrate that IS can successfully learn in a changing environment where the tasks to be solved become more and more difficult over time (inductive transfer).

Task sequence. Our system is exposed to a sequence of more and more complex function approximation problems. The functions to be learned are ; ; ; ; .

Trials. The system's single life is decomposable into successive trials , , ..., (but the learner has no a priori concept of a trial). The -th trial lasts from discrete time step until discrete time step , where (system birth) and (system death). In a given trial we first select a function . As the trial number increases, so does the probability of selecting a more complex function. In early trials the focus is on . In late trials the focus is on . In between there is a gradual shift in task difficulty: using a function pointer (initially 1) and an integer counter (initially 100), in trial we select with probability , and with probability . If the reward acceleration during the most recent two trials exceeds a certain threshold (0.05), then is decreased by 1. If becomes 0 then is increased by 1, and is reset to 100. This is repeated until . From then on, is always selected.

Once is selected, randomly generated real values , and are put into work cells 0, 1, 2, respectively. The contents of an arbitrarily chosen work cell (we always use cell 6) are interpreted as the system's response. If fulfills the condition , then the trial ends and the current reward becomes ; otherwise the current reward is 0.0.

Instructions. Instruction sequences can be composed from the following primitive instructions (compare section 4.1):
Add(), Sub(), Mul(), Mov(), IncProb(), EndSelfMod(), JumpHome(). Each instruction occupies 4 successive program cells (some of them unused if the instruction has less than 3 parameters). We use .

Evaluation Condition. SSA is called after each 5th consecutive non-zero reward signal after the end of each SMS, i.e., we set .

Huge search space. Given the primitives above, random search would require about trials on average to find a solution for -- the search space is huge. The gradual shift in task complexity, however, helps IS to learn much faster, as will be seen below.

Results. After about instruction cycles (ca. trials), the system is able to compute almost perfectly, given arbitrary real-valued inputs. The corresponding speed-up factor over (infeasible) random or exhaustive search is about -- compare paragraph Huge search space'' above. The solution (see Figure 5) involves 21 strongly modified probability distributions of the policy (after learning, the correct instructions had extreme probability values). At the end, the most probable code is given by the following integer sequence:

1 2 1 6 1 0 6 6 2 6 6 6 2 6 6 6 2 6 6 6 4 ...

The corresponding program'' and the (very high) probabilities of its instructions and parameters are shown in Table 4.

Table: The final, most probable program'' and the corresponding probabilities.
 Probabilities Instruction Parameters Semantics 1. (0.994, 0.975, 0.991, 0.994) Sub ( 2, 1, 6) 2. (0.994, 0.981, 0.994, 0.994) Sub ( 0, 6, 6) 3. (0.994, 0.994, 0.994, 0.994) Mul ( 6, 6, 6) 4. (0.994, 0.994, 0.994, 0.994) Mul ( 6, 6, 6) 5. (0.869, 0.976, 0.994, 0.994) Mul ( 6, 6, 6) 6. (0.848, --, --, -- ) JumpHome (-, -, -,) 0 IP

Evolution of self-modification frequencies. During its life the system generates a lot of self-modifications to compute the strongly modified policy. This includes changes of the probabilities of self-modifications. It is quite interesting (and also quite difficult) to find out to which extent the system uses self-modifying instructions to learn how to use self-modifying instructions. Figure 6 gives a vague idea of what is going on by showing a typical plot of the frequency of IncProb instructions during system life (sampled at intervals of basic cycles). Soon after its birth, the system found it useful to dramatically increase the frequency of IncProb; near its death (when there was nothing more to learn) it significantly reduced this frequency. This is reminiscent of Schwefel's work (1974) on self-adjusting mutation rates. One major novelty is the adaptive, highly non-uniform distribution of self-modifications on promising'' individual policy components.

Stack evolution. The temporary ups and downs of the stack reflect that as the tasks change, the system selectively keeps still useful old modifications (corresponding to information conveyed by previous tasks that is still valuable for solving the current task), but deletes modifications that are too much tailored to previous tasks. In the end, there are only about 200 stack entries corresponding to only 200 valid probability modifications - this is a small number compared to the about self-modifications executed during system life.

Next: Other Experiments with IS: Up: Implementation 2: Incremental Self-Improvement Previous: Policy and Program Execution
Juergen Schmidhuber 2003-02-25

Back to Optimal Universal Search page
Back to Reinforcement Learning page
Back to Program Evolution page