Overview. This section describes a concrete implementation of the principles above. SMP is a set of variable probability distributions over a set of assembler-like instructions including SMP-modifying instructions. The learner executes a lifelong instruction sequence generated according to its SMP. We describe details of its SSA implementation.
Architecture. There are m addressable program cells with addresses ranging from 0 to m-1. The variable, integer-valued contents of the program cell with address i are denoted di. An internal variable Instruction Pointer (IP) with range always points to one of the program cells (initially to the first one). IP's current position represents the internal state S (and can be used to disambiguate inputs). Instructions and their parameters are encoded by a fixed set I of nops integer values . For each value j in I, there is an instruction Bj with Nj integer-valued parameters.
Stochastic, self-modifying policy. For each program cell i there is a variable probability distribution SMPi over I. For every possible , ( , SMPij specifies for cell i the conditional probability that, when pointed to by IP, its contents will be set to j. The set of all current SMPij-values defines a probability matrix SMP (the learner's current stochastic policy) with columns SMPi . If IP = i, then the contents of i, namely di, will be interpreted as instruction Bd<<110>>i, and the contents of cells immediately following i will be interpreted as Bd<<111>>i's arguments, selected according to the corresponding SMP-values. See Basic Cycle below.
Normal instructions. Normal instructions are those that do not change SMP. An example is B0 (later we will introduce many additional, problem-specific instructions for interaction with an environment):
Primitive learning algorithms (PLAs). By definition, a PLA is an instruction that modifies SMP. PLAs such as IncProb below and other instructions may be composed to form complex (probabilistic) learning algorithms. In what follows, the symbols a1,a2,a3 stand for instruction parameters in .
Evaluation instruction. By executing the instruction ``PrepareEvaluation'' the learner can influence the timing of the next checkpoint evaluating the learner's lifelong performance so far:
Initialization. At time 0 (system birth) we initialize SMP with maximum entropy distributions (all SMPij equal), and set IP=0. We introduce an initially empty stack that allows for variable-sized stack entries, and the conventional push and pop operations. The integer variable NumR (initially zero) is used to count down the remaining non-zero reward signals since the last execution of ``PrepareEvaluation''. The Boolean variable SMPchanged is initially false.
Basic cycle of operations. Starting at time 0, until time T (system death), the system repeats the following basic instruction cycle over and over again in real time:
ELSE pop off all stack entries above the one for tag (t', R(t')) (these entries will be former SMP-components saved during an earlier execution of step 2), and use them to restore SMP as it was before time t'. Then also pop off tag (t', R(t')). Go to SSA.1.
Each step above will consume various amounts of system life-time. Note a few differences to ``Genetic Programming'': (1) no program populations, (2) self-modification ability, (3) stochastic policy, (4) instructions generated and executed online, (5) SSA calls evaluating the entire life so far.
Lifelong reward acceleration. After each call of SSA the history of surviving modifications is guaranteed to be a life-time success-story (in the worst case an empty one). We don't have to assume fully observable environments -- compare SSC and generalization assumption in section 2.
Speed? Due to the generality of the approach, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of E and . This lack of quantitative convergence results is shared by almost all other, less general RL schemes.