Outline. The ``evolutionary'' system in this section (see also [32]) implements the ideas from section 1, in particular those in paragraph (**) on ``incremental selfimprovement''. To improve/speed up its own (initially very dumb, highly random) learning strategy, the system policy makes use of an assemblerlike programming language suitable to modify the policy itself. The policy actually is a set of conditional, modifiable probability distributions (initially maximum entropy distributions) on a set of assemblerlike instructions. These distributions are required to compute the probability of the assemblerlike instruction to be executed next (the system executes a lifelong sequence of such instructions). The PMP from section 1 are selfdelimiting ``sequences of selfmodifications'' (SSMs). SSMs are special instruction subsequences (generated according to the current policy) that define their own beginning and their own end (by executing special instructions). With the help of special ``selfreferential'' instructions, SSMs can compute almost arbitrary policy modifications (which actually are sequences of probability modifications). Policy modifications can be computed only by SSMs. SSMs affect the probabilities of future SSMs, which affect the probabilities of future SSMs, etc. These recursive effects are automatically taken care of by SSA: following the SSA principle, whenever an SSM computes a policy modification, information required to restore original probability distributions and to compute reinforcement/time ratios for currently valid modifications is pushed on a stack. After each instruction executed outside an SSM, the system runs a popping process (see section 1). It can be guaranteed that the system, after each action executed outside an SSM, will achieve SSC ( ``learning how to learn''). In principle, the system can learn to deal with arbitrary reward delays by determining the lengths of its SSMs  SSMs define checkpoints (see section 1) by themselves.
The system is ``evolutionary'' in the sense that it lets survive only those (initially highly random) policy modifications that were followed by longterm reinforcement acceleration. The system has been implemented and tested on (nonMarkovian) toy tasks in changing environments. The experiments illustrate the theoretical predictions: the system computes action subsequences leading to faster and faster reinforcement intake (section 3). Due to the generality of the approach, however, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of the environment and the choice of initial primitive instructions.
Storage. The system's storage is a single array of cells. Each cell has an integer address in the interval . Max is a positive integer. Min is a negative integer. denotes the cell with address . The variable contents of are denoted by , and are of type integer as well ( ; ). Special addresses, , , , and , are used to further divide storage into segments: Min InputStart InputEnd RegisterStart ProgramStart Max. The input area is the set of input cells . The register area is the set of register cells . ``Registers'' are convenient for indirectaddressing purposes. The program area is the set of program cells . Integer sequences in the program area are interpreted as executable code. The work area is the set of work cells . Instructions executed in the program area may read from and write to the work area. Both register area and input area are subsets of the work area.
Environmental inputs. At every time step, new inputs from the environment may be written into the input cells.
Primitives. The assemblerlike programming language is partly inspired by the one in [33,38]. There are primitive instructions ( ). Each ``primitive'' is represented by a unique number in the set (due to the code being written in C). The primitive with number is denoted by . Primitives may have from zero to three arguments, each of which has a value in . The semantics of the primitives and their corresponding arguments are given in Table 1. The nonselfreferential (``normal'') primitives include actions for comparisons, and conditional jumps, for copying storage contents, for initializing certain storage cells with small integers, and for adding, subtracting, multiplying, and dividing. They also include output actions for modifying the environment, and input actions for perceiving environmental states. The ``selfreferential'' primitives will be described in detail below.
Primitive and argument probabilities. For each cell in the program area, there is a discrete probability distribution over the set of possible cell contents. There is a variable InstructionPointer (IP) which always points to one of the program cells (initially to the first one). If IP and , then denotes the probability of selecting primitive as the next instruction. The restriction is needed to leave room for the instruction's possible arguments should it require any. Once is selected: . If has a first argument, then is the probability of being chosen as its actual value, for . Once some is selected: . Analoguously, if has a second argument, then is the probability of being chosen as its actual value, for . Once some is selected: . And finally, if has a third argument, then is the probability of being chosen as its actual value, for . Once some is selected: .
Double indexed addressing. Although program cells can take on only few values, the use of doubleindexed addressing allows for addressing the entire storage: a parameter in a program cell may point to a work cell with a small address. The content of this work cell, however, does not have essential limits, and may point to any address in storage. See [11,53,42,40,39], however, for alternative implementations without double indexed addressing.
Current policy. The set of all current values defines the system's current policy.
Instruction cycle. A single step of the interpreter works as follows: if IP points to program cell , a primitive and the corresponding arguments are chosen randomly according to the current probability distributions, as already described. They are sequentially written onto the program area, starting from . Syntax checks are performed. Rules for syntactic correctness are given in the caption of Table 1. If syntactically correct, the instruction gets executed. This may result in modifications of IP and/or environment and/or storage. If the instruction did not change the value of IP (e.g. by causing a jump), IP is set to the address of the cell following the last argument of the current instruction. If the instruction is syntactically incorrect, IP is reset to the first program cell. This instruction cycle represents the basic operation of the system.
System life. At time step 0, storage is initialized with zeros. The probability distributions of all program cells are initialized with maximum entropy distributions [44]. That is, all values are initialized to the same value, so that there is no bias for a particular value in any cell. After initialization, the instruction cycle is repeated over and over again until system death at time . Recall that does not have to be known in advance.
Storage as part of the environment. After initialization at time 0, the storage cells are never reinitialized: the storage has to be viewed as part of the total enviroment of the system policy. The storage is like an external notebook. Notes written into the notebook may influence future actions^{2}. Notes may represent useful memories, or may have harmful long term effects.
Selfreferential primitives. Two special primitives, DecP and IncP, may be used to address and modify the current probability distribution of any program cell (see Table 1). With the action DecP, the value for a particular cell/value pair can be decreased by some factor in , , , . The probabilities for that cell are then normalized. Likewise, with the action IncP, the complement () of the value for a particular cell and value can be decreased by a factor in (and the cell probabilities are again renormalized). DecP and IncP have no effect if the indirectly addressed cell contents (see Table 1) are not an integer between 1 and 99, or if the corresponding probability modification would lead to at least one value below MinP (a small positive constant).
The primitive GetP can be used to write scaled versions of current probability values into work cells. GetP is potentially useful for purposes of introspection.
There is also another selfreferential instruction, EndSelfMod(), that helps to group a sequence of probabilitymodifying instructions and other instructions into selfdelimiting selfmodification sequences (see below).

Selfdelimiting sequences of selfmodifications (SSMs). They correspond to the PMP from section 1. The beginning of the first or after an EndSelfMod() action or after system ``birth'' begins an SSM. The SSM ends itself by executing EndSelfMod(). Some of the (initially highly random) action subsequences executed during system life will indeed be SSMs. They can compute almost arbitrary sequences of modifications of values, resulting in almost arbitrary modifications of contextdependent probabilities of future action subsequences, including future SSMs. However, due to MinP being positive, the probability of selecting and executing a particular instruction at a particular time cannot entirely vanish. In agreement with section 1, the th SSM is called PMP.
Keeping track with a stack. Following the SSA principle, whenever the system uses or to modify one of its probability distributions, the following values are pushed onto a stack : the current time, total reinforcement so far, the address of the modified program cell, its old probability distribution right before the current modification (represented by real values), and a pointer to the stack entry corresponding to the first probability modification computed by the current SSM. This information is needed by the SSA popping process to be described later. More formally:
The th entry of , , is denoted . consists of the following variables: , , , (a vector of variables), and . The variable points to the current topmost stack entry ( at system startup). The zeroth stack entry, which cannot be popped, is initialized as follows: ; ; . The remaining values are undefined. If some SSM modifies some probability distribution at time (by using or ), is incremented, ; ; ; before the modification (represented by real values). If marks the beginning of an SSM then . Otherwise (in between beginning and end of an SSM), .
SSA popping processes. Popping occurs right before the first pushing process of each SSM, and also after each instruction except when an SSM is running (these are the possible additional checkpoints referred to in section 1): ``old'' probability distributions are successively popped and restored, until SSC (see section 1) is satisfied. Of course, the computation time required for pushing and popping probability distributions etc. is taken into account when reinforcement/time ratios are computed. More formally ( denotes the current time):
While and
do: .
Note: increases during execution of the while loop.
Lemma 1. According to the SSA principle (section 1), after each instruction (except during the execution of an SSM), an SSA popping process ensures that the beginning of each completed SSM that computed still valid probability modifications has been followed by faster reinforcement intake than the beginnings of all previous such SSMs.
Proof sketch. Induction over the stack contents after popping, in analogy to proof of Theorem 1, section 1.
Arbitrary reward delays. The EndSelfMod() instruction allows the system to delay SSA's evaluations of probability modifications arbitrarily (the expectation of the delay remains finite, however, due to being positive). These checkpoint setting capabilities are important, for two reasons: (1) In general, reinforcement events will be separated by long (unknown) time lags. Hence, novel probability modifications are not necessarily bad if they do not lead to immediate reinforcement. In principle, the system itself can learn how much time to spend on waiting for first reinforcement events. (2) Two successive modifications of two particular probability distributions may turn out to be beneficial, while each by itself may be harmful. For this reason, the system is allowed to compute arbitrary sequences of probability modifications, before facing SSA's evaluations.
Delaying evaluations does cost time, though, which is taken into account by SSA. In the long run, the system is encouraged to create useful SSMs of the appropriate size.
Accelerating reinforcement intake. Lemma 1 shows: the scheme keeps only modifications caused by SSMs followed by faster and faster reinforcement intake, thus satisfying SSC from section 1. Due to nonvanishing probabilities of arbitrary action sequences (recall that MinP is positive), the system will eventually discover a way to improve its current performance, if there is any.
Learning how to learn. Performance improvements include improvements that make future improvements more likely: SSMs can prove their long term usefulness by setting the stage for additional, useful SSMs, which potentially include SSMs executing known (and not yet known) learning algorithms. The success of an SSM recursively depends on the success of all later SSMs. SSA automatically takes care of this, thus recursively encouraging ``learning how to learn how to learn ...''. This represents an essential difference to previous approaches to ``continual'' learning, see [24].
Improvement speed? Due to the generality of the approach, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of the environment and the choice of initial primitive instructions. This lack is shared by almost all other reinforcement learning approaches, though. Note, however, that unlike previous, less general systems, the novel system in principle can exploit almost arbitrary environmental regularities [14,5,45,21] (if there are any) to speed up performance improvement, simply because it can run almost arbitrary learning algorithms. For instance, in principle, unlike previous evolutionary and genetic algorithms [23,43,13,12,15], the system can learn to focus its modifications on interfaces between useful ``subprograms'' (``divide and conquer''), instead of mutating the subprograms themselves (if this proves to be beneficial in a given environment), thus creating a higherlevel, more abstract search space ( directed mutations as opposed to random mutations). Just as evolution ``discovered'' that having the ``genetic crossover operator'' was a ``good thing'', the system is potentially able to discover that various more directed search strategies are ``good things''. There is just no feasible way of predicting the precise nature of such speedups.
How many environments are regular? The last paragraph mentioned that the novel system in principle can exploit almost arbitrary environmental regularities, if there are any. One might ask: how likely is it that a given environment contains regularities at all? How many environments do indeed allow for successful generalization from previous experience? A recent debate in the machinelearning community highlighted a fact that appears discouraging at first glance: in general, generalization cannot be expected, inductive inference is impossible, and nothing can be learned. See, e.g., [8,26,52,31,38,37]. Paraphrasing from a previous argument [38,37]: let the task be to learn some relation between finite bitstrings and finite bitstrings. Somehow, a training set is chosen. In almost all cases, the shortest algorithm computing a (nonoverlapping) test set essentially has the same size as the whole test set. This is because most computable objects are irregular and incompressible [14,5]. The shortest algorithm computing the test set, given the training set, isn't any shorter. In other words, the relative algorithmic complexity of the test set, given the training set, is maximal, and the mutual algorithmic information between test set and training set is zero (ignoring an additive constant independent of the problem  see [14,5,45,21]). Therefore, in almost all cases, (1) knowledge of the training set does not provide any clues about the test set, (2) there is no hope for generalization, and (3) inductive inference does not make any sense. Similarly, almost all possible environments are ``hostile'' in the sense that they don't provide regularities exploitable by any learning algorithm. This may seem discouraging.
Atypical real world. Apparently, however, generalization and inductive inference do make sense in the real world! One reason for this may be that the real world is run by a short algorithm. See, e.g., [37]. Anyway, problems that humans consider to be typical are atypical when compared to the general set of all welldefined problems. Otherwise, things like ``learning by analogy'', ``learning by chunking'', ``incremental learning'', ``continual learning'', ``learning from invariances'', ``learning by knowledge transfer'' etc. would not be possible, and experience with previous problems could not help to sensibly adjust the prior distribution of solution candidates in the search space for a new problem (shift of inductive bias, e.g. [48]).
To repeat: an interesting feature of incremental selfimprovement is that its theoretical potential for exploiting environmental regularities, if there are any, exceeds the corresponding potential of previous learning systems.