Next: EXPERIMENT: SOLVING A DIFFICULT Up: REINFORCEMENT LEARNING WITH SELF-MODIFYING Previous: BASIC CONCEPTS

AN IMPLEMENTATION

Overview. This section describes a concrete implementation of the principles above. SMP is a set of variable probability distributions over a set of assembler-like instructions including SMP-modifying instructions. The learner executes a lifelong instruction sequence generated according to its SMP. We describe details of its SSA implementation.

Architecture. There are m addressable program cells with addresses ranging from 0 to m-1. The variable, integer-valued contents of the program cell with address i are denoted d_i. An internal variable Instruction Pointer (IP) with range $\{0, \ldots, m-1 \}$ always points to one of the program cells (initially to the first one). IP's current position represents the internal state S (and can be used to disambiguate inputs). Instructions and their parameters are encoded by a fixed set I of n_ops integer values $\{0, \ldots, n_{ops}-1 \}$ . For each value j in I, there is an instruction B_j with N_j integer-valued parameters.

Stochastic, self-modifying policy. For each program cell i there is a variable probability distribution SMP_i over I. For every possible $j \in I$ , ( $0\leq j \leq n_{ops}-1)$ , SMP_ij specifies for cell i the conditional probability that, when pointed to by IP, its contents will be set to j. The set of all current SMP_ij-values defines a probability matrix SMP (the learner's current stochastic policy) with columns SMP_i $(0\leq i \leq m-1)$ . If IP = i, then the contents of i, namely d_i, will be interpreted as instruction B_{d_<<110>>i}, and the contents of cells immediately following i will be interpreted as B_{d_<<111>>i}'s arguments, selected according to the corresponding SMP-values. See Basic Cycle below.

Normal instructions. Normal instructions are those that do not change SMP. An example is B₀ (later we will introduce many additional, problem-specific instructions for interaction with an environment):

B₀:: JumpHome -- set IP = 0 (jump back to 1st program cell).

Primitive learning algorithms (PLAs). By definition, a PLA is an instruction that modifies SMP. PLAs such as IncProb below and other instructions may be composed to form complex (probabilistic) learning algorithms. In what follows, the symbols a1,a2,a3 stand for instruction parameters in $\{ 0, 1, \ldots, n_{ops}-1 \}$ .

B₁:: IncProb(a1, a2, a3) -- Set i:=(a1*n_ops+a2)/k (integer division); set j:=a3. If $0\leq i \leq m-1$ , increase SMP_ij by $\gamma$ percent, and renormalize SMP_i (but prevent SMP-components from falling below a minimal value $\epsilon$ , to avoid near-determinism). We will use $\gamma = 15, \epsilon = 0.001$ , k=3.

Although our PLAs do not allow for composing arbitrary probabilistic learning algorithms, they do allow for a wide variety of them (in [#!Schmidhuber:94self!#] a universal programming language is used to create the potential for even more powerful learning algorithms). Later we will integrate Q-learning [#!WatkinsDayan:92!#,#!Lin:93!#] into SMP as a primitive instruction.

Evaluation instruction. By executing the instruction ``PrepareEvaluation'' the learner can influence the timing of the next checkpoint evaluating the learner's lifelong performance so far:

B₂:: PrepareEvaluation(a1) -- Temporarily disable PLAs (e.g., IncProb) by preventing them from executing further SMP-modifications, until a1+1 additional, positive reward signals have been received (this will define a checkpoint and trigger an SSA call -- see basic cycle below).

Initialization. At time 0 (system birth) we initialize SMP with maximum entropy distributions (all SMP_ij equal), and set IP=0. We introduce an initially empty stack $\cal S$ that allows for variable-sized stack entries, and the conventional push and pop operations. The integer variable NumR (initially zero) is used to count down the remaining non-zero reward signals since the last execution of ``PrepareEvaluation''. The Boolean variable SMPchanged is initially false.

Basic cycle of operations. Starting at time 0, until time T (system death), the system repeats the following basic instruction cycle over and over again in real time:

1.

Randomly generate an integer $j \in I$ according to matrix column SMP_IP (the probability distribution of the program cell pointed to by IP). Set program cell content d_IP:= j. Translate j into the corresponding current instruction B_j. Look up the number N_j of cells required to store B_j's parameters. IF IP>m-N_j-2, THEN set IP to 0, go to step 1. ELSE generate instruction arguments for the N_j cells immediately following IP, according to their probability distributions SMP_IP+1, ..., SMP_{IP+N_j}, and set IP $\leftarrow$ IP+N_j+1.

2.

IF B_j is a PLA, and not currently disabled by a previous ``PrepareEvaluation'' instruction, THEN push copies of those SMP_i to be modified by B_j onto $\cal S$ .

3.

Execute instruction B_j. This may change (1) environment, (2) IP, (3) SMP itself. In case (3), set SMP changed = TRUE. IF B_j is ``PrepareEvaluation'' and SMP changed = TRUE, THEN set NumR equal to B_j's actual parameter plus one.

4.

IF NumR>0 and non-zero reward occurred during the current cycle, THEN decrement NumR. IF NumR=0, THEN call SSA to achieve the success-story criterion by backtracking as follows:

SSA.0.

Set variable t equal to current time (t is a checkpoint).

SSA.1.

IF there is no ``tag'' stored somewhere in $\cal S$ (tags are pairs of checkpoints in V (see section 2) and rewards pushed in earlier SSA calls), THEN push the tag (t, R(t)) onto $\cal S$ , set SMP changed = FALSE, and go to 1 (this ends the SSA call).

SSA.2.

Denote the topmost tag in $\cal S$ by (t', R(t')). Denote the one below by (t'', R(t'')) (if there isn't any tag below, set variable t''=0 -- recall R(t'')= R(0) = 0).

SSA.3.

IF $\frac{R(t) - R(t')}{t - t'} > \frac{R(t) - R(t'')}{t - t''}$ THEN push (t, R(t)) as a new tag, set SMP changed = FALSE, and go to 1. This ends the SSA call.

ELSE pop off all stack entries above the one for tag (t', R(t')) (these entries will be former SMP-components saved during an earlier execution of step 2), and use them to restore SMP as it was before time t'. Then also pop off tag (t', R(t')). Go to SSA.1.

Each step above will consume various amounts of system life-time. Note a few differences to ``Genetic Programming'': (1) no program populations, (2) self-modification ability, (3) stochastic policy, (4) instructions generated and executed online, (5) SSA calls evaluating the entire life so far.

Lifelong reward acceleration. After each call of SSA the history of surviving modifications is guaranteed to be a life-time success-story (in the worst case an empty one). We don't have to assume fully observable environments -- compare SSC and generalization assumption in section 2.

Speed? Due to the generality of the approach, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of E and $\cal A$ . This lack of quantitative convergence results is shared by almost all other, less general RL schemes.

Next: EXPERIMENT: SOLVING A DIFFICULT Up: REINFORCEMENT LEARNING WITH SELF-MODIFYING Previous: BASIC CONCEPTS

Juergen Schmidhuber
2001-01-25

Back to Metalearning page
Back to Reinforcement Learning page