next up previous
Next: EXPERIMENT: SOLVING A DIFFICULT Up: REINFORCEMENT LEARNING WITH SELF-MODIFYING Previous: BASIC CONCEPTS

AN IMPLEMENTATION


Overview. This section describes a concrete implementation of the principles above. SMP is a set of variable probability distributions over a set of assembler-like instructions including SMP-modifying instructions. The learner executes a lifelong instruction sequence generated according to its SMP. We describe details of its SSA implementation.

Architecture. There are m addressable program cells with addresses ranging from 0 to m-1. The variable, integer-valued contents of the program cell with address i are denoted di. An internal variable Instruction Pointer (IP) with range $\{0, \ldots, m-1 \}$ always points to one of the program cells (initially to the first one). IP's current position represents the internal state S (and can be used to disambiguate inputs). Instructions and their parameters are encoded by a fixed set I of nops integer values $\{0, \ldots, n_{ops}-1 \}$. For each value j in I, there is an instruction Bj with Nj integer-valued parameters.

Stochastic, self-modifying policy. For each program cell i there is a variable probability distribution SMPi over I. For every possible $j \in I$, ( $0\leq j \leq n_{ops}-1)$, SMPij specifies for cell i the conditional probability that, when pointed to by IP, its contents will be set to j. The set of all current SMPij-values defines a probability matrix SMP (the learner's current stochastic policy) with columns SMPi $(0\leq i \leq m-1)$. If IP = i, then the contents of i, namely di, will be interpreted as instruction Bd<<110>>i, and the contents of cells immediately following i will be interpreted as Bd<<111>>i's arguments, selected according to the corresponding SMP-values. See Basic Cycle below.

Normal instructions. Normal instructions are those that do not change SMP. An example is B0 (later we will introduce many additional, problem-specific instructions for interaction with an environment):

B0:
JumpHome -- set IP = 0 (jump back to 1st program cell).

Primitive learning algorithms (PLAs). By definition, a PLA is an instruction that modifies SMP. PLAs such as IncProb below and other instructions may be composed to form complex (probabilistic) learning algorithms. In what follows, the symbols a1,a2,a3 stand for instruction parameters in $\{ 0, 1, \ldots, n_{ops}-1 \}$.

B1:
IncProb(a1, a2, a3) -- Set i:=(a1*nops+a2)/k (integer division); set j:=a3. If $0\leq i \leq m-1$, increase SMPij by $\gamma$ percent, and renormalize SMPi (but prevent SMP-components from falling below a minimal value $\epsilon$, to avoid near-determinism). We will use $\gamma = 15, \epsilon = 0.001$, k=3.
Although our PLAs do not allow for composing arbitrary probabilistic learning algorithms, they do allow for a wide variety of them (in [#!Schmidhuber:94self!#] a universal programming language is used to create the potential for even more powerful learning algorithms). Later we will integrate Q-learning [#!WatkinsDayan:92!#,#!Lin:93!#] into SMP as a primitive instruction.

Evaluation instruction. By executing the instruction ``PrepareEvaluation'' the learner can influence the timing of the next checkpoint evaluating the learner's lifelong performance so far:

B2:
PrepareEvaluation(a1) -- Temporarily disable PLAs (e.g., IncProb) by preventing them from executing further SMP-modifications, until a1+1 additional, positive reward signals have been received (this will define a checkpoint and trigger an SSA call -- see basic cycle below).

Initialization. At time 0 (system birth) we initialize SMP with maximum entropy distributions (all SMPij equal), and set IP=0. We introduce an initially empty stack $\cal S$ that allows for variable-sized stack entries, and the conventional push and pop operations. The integer variable NumR (initially zero) is used to count down the remaining non-zero reward signals since the last execution of ``PrepareEvaluation''. The Boolean variable SMPchanged is initially false.

Basic cycle of operations. Starting at time 0, until time T (system death), the system repeats the following basic instruction cycle over and over again in real time:

1.
Randomly generate an integer $j \in I$ according to matrix column SMPIP (the probability distribution of the program cell pointed to by IP). Set program cell content dIP:= j. Translate j into the corresponding current instruction Bj. Look up the number Nj of cells required to store Bj's parameters. IF IP>m-Nj-2, THEN set IP to 0, go to step 1. ELSE generate instruction arguments for the Nj cells immediately following IP, according to their probability distributions SMPIP+1, ..., SMPIP+Nj, and set IP $\leftarrow$ IP+Nj+1.

2.
IF Bj is a PLA, and not currently disabled by a previous ``PrepareEvaluation'' instruction, THEN push copies of those SMPi to be modified by Bj onto $\cal S$.

3.
Execute instruction Bj. This may change (1) environment, (2) IP, (3) SMP itself. In case (3), set SMP changed = TRUE. IF Bj is ``PrepareEvaluation'' and SMP changed = TRUE, THEN set NumR equal to Bj's actual parameter plus one.

4.
IF NumR>0 and non-zero reward occurred during the current cycle, THEN decrement NumR. IF NumR=0, THEN call SSA to achieve the success-story criterion by backtracking as follows:

SSA.0.
Set variable t equal to current time (t is a checkpoint).

SSA.1.
IF there is no ``tag'' stored somewhere in $\cal S$ (tags are pairs of checkpoints in V (see section 2) and rewards pushed in earlier SSA calls), THEN push the tag (t, R(t)) onto $\cal S$, set SMP changed = FALSE, and go to 1 (this ends the SSA call).

SSA.2.
Denote the topmost tag in $\cal S$ by (t', R(t')). Denote the one below by (t'', R(t'')) (if there isn't any tag below, set variable t''=0 -- recall R(t'')= R(0) = 0).

SSA.3.
IF $
\frac{R(t) - R(t')}{t - t'}
>
\frac{R(t) - R(t'')}{t - t''}
$ THEN push (t, R(t)) as a new tag, set SMP changed = FALSE, and go to 1. This ends the SSA call.

ELSE pop off all stack entries above the one for tag (t', R(t')) (these entries will be former SMP-components saved during an earlier execution of step 2), and use them to restore SMP as it was before time t'. Then also pop off tag (t', R(t')). Go to SSA.1.

Each step above will consume various amounts of system life-time. Note a few differences to ``Genetic Programming'': (1) no program populations, (2) self-modification ability, (3) stochastic policy, (4) instructions generated and executed online, (5) SSA calls evaluating the entire life so far.

Lifelong reward acceleration. After each call of SSA the history of surviving modifications is guaranteed to be a life-time success-story (in the worst case an empty one). We don't have to assume fully observable environments -- compare SSC and generalization assumption in section 2.

Speed? Due to the generality of the approach, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of E and $\cal A$. This lack of quantitative convergence results is shared by almost all other, less general RL schemes.


next up previous
Next: EXPERIMENT: SOLVING A DIFFICULT Up: REINFORCEMENT LEARNING WITH SELF-MODIFYING Previous: BASIC CONCEPTS
Juergen Schmidhuber
2001-01-25


Back to Metalearning page
Back to Reinforcement Learning page