next up previous


Task. The external environment consists of an array of 30 variables $V_0, V_1, \ldots, V_{29}$. The $i$-th variable is denoted by $V_i$. Its current contents are denoted by $C_i \in [ -Maxint, Maxint]$. Time is measured in discrete time steps. At time step 0, all variables are initialized with zeros. Every 1000 time steps, the number of variables whose values equal their addresses is counted and written into a special input cell. This number is the current payoff. Then, all variables are re-initialized with zeros. The goal is to maximize cumulative payoff.

Details. In addition to the 17 general primitives from Table 1 (not counting input/output primitives), there are two problem-specific primitives. Each has two integer arguments: $Write(a1, a2)$ writes the contents of the storage cell indirectly addressed by the first argument into the variable indirectly addressed by the second argument. $Read(a1, a2)$ writes the contents of the variable indirectly addressed by the second argument into the work cell indirectly addressed by the second argument. See Table 2. Between two payoff events, each variable may be written only once (additional write operations have no effect). Write and read operations outside the valid ranges halt the current run.

Table 2: Semantics of problem specific primitives and their parameters. Again, double-indexed indirect addressing is employed. See text for rules for syntactic correctness. Compare with Table 1.
Primitive Semantics
Write(a1, a2) $C_{c_{a2}} \leftarrow c_{c_{a1}}$
Read(a1, a2) $c_{c_{a1}} \leftarrow C_{c_{a2}}$

Since $n_{ops} = 17 + 2 = 19$, all initial probabilities of all possible contents of all program cells are equal to $\frac{1}{19}$. Parameters for storage size etc. are: $Min = -1000$, $Max = 100$, $ProgramStart = bottom(\frac{n_{ops}}{2})$, $MinP = 0.001$, $StackSize = 10,000$, $Maxint = 100,000$. To inform the system about what is going on, the following values are written into special input cells whenever they change: IP, sp, and the remainder of $t / Maxint$ (integer division, where $t$ denotes the current time).

Measuring time. By definition, each computation that requires the consideration of all $n_{ops}$ probabilities of some program cell (such as selecting an instruction, selecting a parameter, pushing or popping probability distributions, etc.) costs one time step. Other computations do not cost anything. This ensures that measured time is of the order of total cpu-time. The somewhat unelegant way of measuring time was introduced because measuring cpu-time directly on our multi-user machine turned out to be somewhat unreliable.

How difficult is this task? For a number of reasons, the task is non-trivial -- the system does not appear to have much built-in bias towards the task: (1) Only one of the 19 primitives ($Write$) may affect variable contents at all. But initially, the system does not even have such seemingly trivial knowledge -- there is no built-in idea about which actions may lead to payoff. Therefore, it has to find out on its own. (2) The values referred to by the two arguments of $Write$ have to be identical and within the required ranges to lead to a useful result. (3) There are 30 different variables with 30 different values. Only one of them, namely $V_0$, is correctly re-initialized with its own address after each payoff event. (4) The most significant difficulty, however, is the continually changing policy environment.

Changing policy environment. The fact that the external variables are occasionally reset does not imply that the policy environment does not change. In fact, many actions executed by the system change the contents of storage cells, which are never reset: as mentioned above, storage cells are like an external notebook and have to be viewed as part of the environment -- typically, there are no exactly repeatable trials with identical initial conditions. And the storage cells are not just a negligible nuisance -- they are essential for computing parameters for (e.g., self-modifying) instructions. To achieve significant performance improvement, the environmental conditions have to change due to actions executed by the system itself. This is an aspect that cannot be dealt with by any traditional reinforcement learning system, not even by naive exhaustive search, in a way that is sound.

Comparison. Performance was measured with and without self-modification capabilities. In the latter case, the primitives $IncP$ and $DecP$ had no effect. Both versions were run for $5*10^9$ time steps, corresponding to $5*10^6$ payoff events. Note that the optimal cumulative payoff is $1.5 * 10^8$. This value can be achieved only by a system with ``optimal'' prior bias -- starting at birth, such a system keeps executing optimal actions without having to learn anything.

next up previous
Juergen Schmidhuber 2003-02-19