Storage / Instructions. The learner makes use of an assembler-like programming language similar to but not quite as general as the one in [#!Schmidhuber:95kol!#]. It has addressable work cells with addresses ranging from 0 to . The variable, real-valued contents of the work cell with address are denoted . Processes in the external environment occasionally write inputs into certain work cells. There also are addressable program cells with addresses ranging from 0 to . The variable, integer-valued contents of the program cell with address are denoted . An internal variable Instruction Pointer (IP) with range always points to one of the program cells (initially to the first one). There also is a fixed set of integer values , which sometimes represent instructions, and sometimes represent arguments, depending on the position of IP. IP and work cells together represent the system's internal state (see section 2). For each value in , there is an assembler-like instruction with integer-valued parameters. In the following incomplete list of instructions ( ) to be used in experiment 3, the symbols stand for parameters that may take on integer values between and (later we will encounter additional instructions):
Instruction probabilities / Current policy. For each program cell there is a variable probability distribution on . For every possible , ( , specifies for cell the conditional probability that, when pointed to by IP, its contents will be set to . The set of all current -values defines a probability matrix with columns . is called the learner's current policy. In the beginning of the learner's life, all are equal (maximum entropy initialization). If IP , the contents of , namely , will be interpreted as instruction (such as Add or Mul), and the contents of cells that immediately follow will be interpreted as 's arguments, to be selected according to the corresponding -values. See Figure 4.
Self-modifications. To obtain a learner that can explicitly modify its own policy (by running its own learning strategies), we introduce a special self-modification instruction IncProb not yet mentioned above:
In conjunction with other primitives, may be used in instruction sequences that compute directed policy modifications. Calls of represent the only way of modifying the policy.
Self-delimiting self-modification sequences (SMSs). SMSs are subsequences of the lifelong action sequence. The first after the learner's ``birth'' or after each SSA call (see section 2) begins an SMS. The SMS ends by executing another yet unmentioned primitive:
Some of the (initially highly random) action subsequences executed during system life will indeed be SMSs. Depending on the nature of the other instructions, SMSs can compute almost arbitrary sequences of modifications of values. This may result in almost arbitrary modifications of context-dependent probabilities of future action subsequences, including future SMSs. Policy changes can be generated only by SMSs. SMSs build the basis for ``metalearning'': SMSs are generated according to the policy, and may change the policy. Hence, the policy can essentially change itself, and also the way it changes itself, etc.
SMSs can influence the timing of backtracking processes, because they can influence the times at which the EVALUATION CRITERION will be met. Thus SMSs can temporarily protect the learner from performance evaluations and policy restaurations.
Plugging SMSs into SSA. We replace step 1 in the basic cycle (see section 2) by the following procedure:
We also change step 3 in the SSA cycle as follows: