**Overview.**
This section
describes a concrete implementation of the principles
above. SMP is a set of variable probability distributions over
a set of assembler-like instructions including SMP-modifying
instructions. The learner executes a lifelong instruction sequence
generated according to its SMP.
We describe details of its SSA implementation.

**Architecture.**
There are *m* addressable *program cells*
with addresses ranging from 0 to *m*-1.
The variable, integer-valued contents of the program cell with address *i*
are denoted *d*_{i}.
An internal variable
*Instruction Pointer* (*IP*)
with range
always points
to one of the program cells (initially to the first one).
*IP*'s current position represents the internal state *S*
(and can be used to disambiguate inputs).
Instructions and their parameters are encoded by
a fixed set *I* of *n*_{ops} integer values
.
For each value *j* in *I*, there
is an instruction *B*_{j} with *N*_{j} integer-valued parameters.

**Stochastic, self-modifying policy.**
For each program cell *i* there is a variable probability
distribution SMP_{i} over *I*.
For every possible ,
(
,
SMP_{ij} specifies for cell *i* the conditional
probability that, when pointed to by
*IP*, its contents will be set to *j*.
The set of all current
SMP_{ij}-values defines a probability
matrix SMP
(the learner's *current stochastic policy*)
with columns SMP_{i}
.
If *IP* = *i*, then
the contents of *i*, namely *d*_{i},
will be interpreted as instruction *B*_{d<<110>>i},
and the contents of cells immediately
following *i* will be interpreted as *B*_{d<<111>>i}'s arguments,
selected according to the corresponding SMP-values.
See *Basic Cycle* below.

**Normal instructions.**
Normal instructions are those that do not change SMP.
An example is *B*_{0} (later we will introduce many additional,
problem-specific instructions for interaction with an environment):

*B*_{0}:*JumpHome -- set**IP*= 0*(jump back to 1st program cell).*

**Primitive learning algorithms (PLAs).**
By definition,
a PLA is an instruction that modifies SMP.
PLAs such as *IncProb* below
and other instructions
may be composed to form complex (probabilistic)
learning algorithms.
In what follows, the symbols *a*1,*a*2,*a*3 stand for instruction
parameters in
.

*B*_{1}:*IncProb(a1, a2, a3)*-- Set*i*:=(*a*1**n*_{ops}+*a*2)/*k*(integer division); set*j*:=*a*3. If , increase SMP_{ij}by percent, and renormalize SMP_{i}(but prevent SMP-components from falling below a minimal value , to avoid near-determinism). We will use ,*k*=3.

**Evaluation instruction.**
By executing the instruction *``PrepareEvaluation''*
the learner can influence the timing of the next checkpoint
evaluating the learner's
lifelong performance so far:

*B*_{2}:*PrepareEvaluation(a1)*-- Temporarily disable PLAs (e.g.,*IncProb*) by preventing them from executing further SMP-modifications, until*a*1+1 additional, positive reward signals have been received (this will define a checkpoint and trigger an SSA call -- see basic cycle below).

**Initialization.**
At time 0 (system birth) we
initialize SMP with maximum
entropy distributions (all SMP_{ij} equal), and set *IP*=0.
We introduce an initially empty stack
that allows for variable-sized stack entries,
and the conventional *push* and *pop* operations.
The integer variable *NumR* (initially zero)
is used to count down the remaining
non-zero reward signals
since the last execution of *``PrepareEvaluation''*.
The Boolean variable SMP*changed* is initially false.

**Basic cycle of operations.**
Starting at time 0, until time *T* (system death),
the system repeats the following basic instruction cycle over and
over again in real time:

**1.**- Randomly generate an integer
according to
matrix column SMP
_{IP}(the probability distribution of the program cell pointed to by*IP*). Set program cell content*d*_{IP}:=*j*. Translate*j*into the corresponding current instruction*B*_{j}. Look up the number*N*_{j}of cells required to store*B*_{j}'s parameters.**IF***IP*>*m*-*N*_{j}-2,**THEN**set*IP*to 0, go to step**1.****ELSE**generate instruction arguments for the*N*_{j}cells immediately following*IP*, according to their probability distributions SMP_{IP+1}, ..., SMP_{IP+Nj}, and set*IP**IP*+*N*_{j}+1. **2.****IF***B*_{j}is a PLA, and not currently disabled by a previous*``PrepareEvaluation''*instruction,**THEN**push copies of those SMP_{i}to be modified by*B*_{j}onto .**3.**- Execute instruction
*B*_{j}. This may change (1) environment, (2)*IP*, (3) SMP itself. In case (3), set SMP*changed*=*TRUE*.**IF***B*_{j}is*``PrepareEvaluation''*and SMP*changed*=*TRUE*,**THEN**set*NumR*equal to*B*_{j}'s actual parameter plus one. **4.****IF***NumR*>0 and non-zero reward occurred during the current cycle,**THEN**decrement*NumR*.**IF***NumR*=0,**THEN**call SSA to achieve the success-story criterion by backtracking as follows:**SSA.0.**- Set variable
*t*equal to current time (*t*is a checkpoint). **SSA.1.****IF**there is no*``tag''*stored somewhere in (tags are pairs of checkpoints in*V*(see section 2) and rewards pushed in earlier SSA calls),**THEN**push the tag (*t*,*R*(*t*)) onto , set SMP*changed*=*FALSE*, and go to**1**(this ends the SSA call).**SSA.2.**- Denote the topmost tag in
by (
*t*',*R*(*t*')). Denote the one below by (*t*'',*R*(*t*'')) (if there isn't any tag below, set variable*t*''=0 -- recall*R*(*t*'')=*R*(0) = 0). **SSA.3.****IF****THEN**push (*t*,*R*(*t*)) as a new tag, set SMP*changed*=*FALSE*, and go to**1.**This ends the SSA call.**ELSE**pop off all stack entries above the one for tag (*t*',*R*(*t*')) (these entries will be former SMP-components saved during an earlier execution of step 2), and use them to restore SMP as it was before time*t*'. Then also pop off tag (*t*',*R*(*t*')). Go to**SSA.1**.

Each step above will consume various amounts of system life-time. Note a few differences to ``Genetic Programming'': (1) no program populations, (2) self-modification ability, (3) stochastic policy, (4) instructions generated and executed online, (5) SSA calls evaluating the entire life so far.

**Lifelong reward acceleration.**
After each call of SSA the history of surviving
modifications is *guaranteed*
to be a life-time success-story (in the worst case
an empty one). We don't have to assume fully
observable environments --
compare SSC and generalization assumption in section 2.

**Speed?**
Due to the generality of the approach, no reasonable
statements can be made about improvement *speed*, which
indeed highly depends on the nature of *E* and .
This lack of quantitative
convergence results is shared by almost all other, less general
RL schemes.

Back to Metalearning page

Back to Reinforcement Learning page