next up previous
Next: Experiment 3: Function Approximation Up: Implementation 2: Incremental Self-Improvement Previous: Implementation 2: Incremental Self-Improvement

Policy and Program Execution

Storage / Instructions. The learner makes use of an assembler-like programming language similar to but not quite as general as the one in [#!Schmidhuber:95kol!#]. It has $n$ addressable work cells with addresses ranging from 0 to $n-1$. The variable, real-valued contents of the work cell with address $k$ are denoted $c_{k}$. Processes in the external environment occasionally write inputs into certain work cells. There also are $m$ addressable program cells with addresses ranging from 0 to $m-1$. The variable, integer-valued contents of the program cell with address $i$ are denoted $d_{i}$. An internal variable Instruction Pointer (IP) with range $\{0, \ldots,
m-1 \}$ always points to one of the program cells (initially to the first one). There also is a fixed set $I$ of $n_{ops}$ integer values $\{0, \ldots, n_{ops}-1 \}$, which sometimes represent instructions, and sometimes represent arguments, depending on the position of IP. IP and work cells together represent the system's internal state $\cal I$ (see section 2). For each value $j$ in $I$, there is an assembler-like instruction $b_j$ with $n_j$ integer-valued parameters. In the following incomplete list of instructions ( $b_0, \ldots, b_3$) to be used in experiment 3, the symbols $w_1,w_2,w_3$ stand for parameters that may take on integer values between $0$ and $n-1$ (later we will encounter additional instructions):

$b_0$:
Add($w_1,w_2,w_3$) : $c_{w_3}\leftarrow c_{w_1} + c_{w_2}$ (add the contents of work cell $w_1$ and work cell $w_2$, write the result into work cell $w_3$ ).

$b_1$:
Sub($w_1,w_2,w_3$) : $c_{w_3}\leftarrow c_{w_1}
- c_{w_2}$.

$b_2$:
Mul($w_1,w_2,w_3$) : $c_{w_3}\leftarrow c_{w_1}
* c_{w_2}$.

$b_3$:
Mov($w_1, w_2$) : $c_{w_2}\leftarrow c_{w_1}$.

$b_4$:
JumpHome: IP$\leftarrow 0$ (jump back to 1st program cell).

Instruction probabilities / Current policy. For each program cell $i$ there is a variable probability distribution $P_i$ on $I$. For every possible $j \in I$, ( $0\leq j \leq n_{ops}-1)$, $P_{ij}$ specifies for cell $i$ the conditional probability that, when pointed to by IP, its contents will be set to $j$. The set of all current $P_{ij}$-values defines a probability matrix $P$ with columns $P_i$ $(0\leq i \leq m-1)$. $P$ is called the learner's current policy. In the beginning of the learner's life, all $P_{ij}$ are equal (maximum entropy initialization). If IP $ = i$, the contents of $i$, namely $d_i$, will be interpreted as instruction $b_{d_i}$ (such as Add or Mul), and the contents of cells that immediately follow $i$ will be interpreted as $b_{d_i}$'s arguments, to be selected according to the corresponding $P$-values. See Figure 4.

Figure: Snapshot of parts of policy and storage. IP currently points to program cell 28. The integer sequence 1 4 2 6 (generated according to the policy's current probability distributions) will be interpreted as Sub(4, 2, 6) -- subtract the contents of work cell 4 from the contents of work cell 4 and put the result into work cell 6.
\begin{figure}\centerline{\psfig{figure=NewCells.ps,width=12cm}}\end{figure}

Self-modifications. To obtain a learner that can explicitly modify its own policy (by running its own learning strategies), we introduce a special self-modification instruction IncProb not yet mentioned above:

$b_5$:
IncProb($w_1,w_2,w_3$) : Increase $P_{ij}$ by $\gamma$ percent, where $i=w_1*n_{ops}+w_2$ and $j=w_3$ (this construction allows for addressing a broad range of program cells), and renormalize $P_i$ (but prevent P-values from falling below a minimal value $\epsilon$, to avoid near-determinism). Parameters $w_1,w_2,w_3$ may take on integer values between $0$ and $n_{ops}-1$. In the experiments, we will use $\gamma = 15, \epsilon = 0.001$.

In conjunction with other primitives, $IncProb$ may be used in instruction sequences that compute directed policy modifications. Calls of $IncProb$ represent the only way of modifying the policy.

Self-delimiting self-modification sequences (SMSs). SMSs are subsequences of the lifelong action sequence. The first $IncProb$ after the learner's ``birth'' or after each SSA call (see section 2) begins an SMS. The SMS ends by executing another yet unmentioned primitive:

$b_6$:
EndSelfMod($w_1$). Temporarily disable IncProb, by preventing future $IncProb$ instructions from causing any probability modifications, until $w_1$ ( $1\leq w_1 \leq n_{ops}-1$) additional non-zero reward signals have been received -- this will satisfy the EVALUATION CRITERION in the basic cycle (section 2).

Some of the (initially highly random) action subsequences executed during system life will indeed be SMSs. Depending on the nature of the other instructions, SMSs can compute almost arbitrary sequences of modifications of $P_{ij}$ values. This may result in almost arbitrary modifications of context-dependent probabilities of future action subsequences, including future SMSs. Policy changes can be generated only by SMSs. SMSs build the basis for ``metalearning'': SMSs are generated according to the policy, and may change the policy. Hence, the policy can essentially change itself, and also the way it changes itself, etc.

SMSs can influence the timing of backtracking processes, because they can influence the times at which the EVALUATION CRITERION will be met. Thus SMSs can temporarily protect the learner from performance evaluations and policy restaurations.

Plugging SMSs into SSA. We replace step 1 in the basic cycle (see section 2) by the following procedure:

1.
REPEAT the following UNTIL the EVALUATION CRITERION is satisfied or the Boolean variable MODIFICATION-CRITERION (initially FALSE) is TRUE:

1.1.
Randomly generate an integer $j \in I$ according to matrix column $P_{IP}$ (the distribution of the program cell pointed to by $IP$, initially 0 at system birth). Set program cell contents $d_{IP}:= j$. Translate $j$ into the corresponding current instruction $b_j$. Look up the number $n_j$ of cells required to store $b_j$'s parameters. If IP$>m-n_j-2$, reset IP to 0, go to step 1. Otherwise generate instruction arguments for the $n_j$ cells immediately following $IP$ according to their probability distributions $P_{IP+1}$, ..., $P_{IP+n_j}$, and set IP to $IP+n_j+1$.

1.2.
IF $b_j$ is a learning instruction and not currently disabled by a previous EndSelfMod instruction, THEN set MODIFICATION-CRITERION $= TRUE$, exit the current REPEAT loop, and go to step 2 of the basic cycle.

1.3.
Execute $b_j$. IF $b_j$ is EndSelfMod and the topmost entry in the stack $\cal S$ is not a ``tag'', THEN set the integer variable $n_{NZR}$ equal to the first parameter of $b_j$ plus one (this will influence the time at which EVALUATION CRITERION will be reached).

1.4.
IF there is a new environmental input, THEN let it modify $\cal I$.

1.5.
IF $n_{NZR}>0$ and non-zero reward occurred during the current cycle, THEN decrement $n_{NZR}$. IF $n_{NZR}$ is zero, THEN set EVALUATION CRITERION $= TRUE$.

We also change step 3 in the SSA cycle as follows:

3.
IF MODIFICATION-CRITERION $= TRUE$, THEN push copies of those $Pol_i$ to be modified by $b_j$ (from step 1.2) onto $\cal S$, and execute $b_j$.


next up previous
Next: Experiment 3: Function Approximation Up: Implementation 2: Incremental Self-Improvement Previous: Implementation 2: Incremental Self-Improvement
Juergen Schmidhuber 2003-02-25


Back to Optimal Universal Search page
Back to Reinforcement Learning page
Back to Program Evolution page