next up previous


Now we attack the non-Markovian maze task from section 3.2 with multiple, very simple, non-self-modifying, learning agents. Each agent is in fact just a connection (whose current weight represents its current policy) in a fully recurrent neural net. A by-product of this research is a general reinforcement learning algorithm for such nets.

Viewing each connection as a separate agent may seem unconventional: standard AI research is used to more complex agents with full-sized knowledge bases etc. For the purposes of this paper, however, this is irrelevant: SSA does not care for the complexity of the agents.

Again, it should be emphasized that the following preliminary experiment in no way represents a systematic experimental analysis. The only purpose of the current section is to illustrate paragraph (*) (in the theoretical section 1).

Basic set-up. There is a fully recurrent neural net with 5 input, 4 output, and 5 hidden units. The variable activation of the $i$-th unit is denoted $o_i$ (initialized at time 0 with 0.0). $w_{ij} \in [-2.0,2.0]$ denotes the real-valued, randomly initialized weight (a variable) on the connection $(i,j)$ from unit $j$ to $i$. There is a lifelong sequence of ``cycles''. A cycle involves the following computations: the activation of the first (second, third, fourth) input unit is set to 0.0 if the field to the north (south, east, west) of the animat is blocked or does not exist, and set to 1.0 otherwise. Each noninput unit $i$ updates its variable activation $o_i$ (initialized at time 0 with 0.0) as follows: $o_i \leftarrow 1.0$ with probability $f(net_i)$, $o_i \leftarrow 0.0$ otherwise, where $f(x) = \frac{1}{1 + e^{-x}}$, and the $net_i$ (initially 0) are sequentially updated as follows: $net_i \leftarrow net_i + \sum_j w_{ij} o_j$. If $i$ stands for an output unit, then $i$'s current value is defined as $v_i \leftarrow o_i + 0.05$. For each output unit $j$, there is an action $a_j$ (possible actions are one-step-north, one-step-south, one-step-east, one-step-west). $a_j$ is selected with a probability proportional to $v_j/\sum_i v_i$. The selected action gets executed. Whenever the animat hits the goal, there is constant reinforcement 1.0, the animat is transferred back to the start position (but the activations and $net$-values of the recurrent net are not reset), and the activation of the fifth input unit is set to 1.0 (0.0 otherwise).

Weight stacks. The current weight of a connection represents its current policy. For each connection $(i,j)$, there is a stack. Following the SSA principle (section 1), whenever a weight is modified (see below), the following values are pushed onto $(i,j)$'s stack: the current time, the total cumulative reinforcement so far, and the weight before the modification.

Weight restauration. After the goal is hit, or if the goal has not been hit for 10,000 cycles (in this case the animat is transferred back to the start position), right before the next cycle, each connection sequentially pops entries from its stack and restores the corresponding previous weight values, until its SSC (see section 1) is satisfied.

Weight modification. So far, preliminary experiments were conducted only with very simple weight modification processes (corresponding to the PMP$_i$ from section 1). Weight modification is executed after each weight restauration (see above), and works as follows. For each weight $w_{ij}$ do: with small probability (0.05), replace $w_{ij}$ by a randomly chosen value in $[-2.0, 2.0]$. Of course, many alternative weight modification processes are possible, but this is irrelevant for the purposes of this paper.

Measuring time. By definition, each computation involving some weight -- such as pushing, popping, computing unit activations etc.-- costs one time step. Other computations do not cost anything. Still, measured time is of the order of total cpu-time.

Where is the changing environment? Although the animat is occasionally reset to its start position, the recurrent net activations and the $net$-values are not: the past may influence the future. Apart from this, each single connection's environment continually changes, simply because all the other connections in its environment keep changing. However, all connections receive the same global reinforcement signal. Hence, for each connection, the only way to speed up its local reinforcement intake is to contribute to speeding up global reinforcement intake. Since no connection can solve the task by itself, this enforces learning to cooperate.

Results without SSA. Again, for purposes of comparison, the system was first run with different weight initializations, and the weight changing mechanism being switched off. Often, the average number of cycles required to reach the goal exceeded 100,000 cycles.

Among the different weight initializations, one was picked that led to an average trial length of 2660 cycles (measured over 20,000,000 cycles). This initialization was used for SSA (see below), because the corresponding average trial length is of the same order of magnitude as the one obtained by using Sutton's original set-up.

Results with SSA. With multi-weight SSA being switched on, after about 10,000 ``trials'' (more precisely: 11,339,740 cycles -- about $2,2* 10^9$ time steps), the average trial length was down to 89. This corresponds to a speed-up factor of about 30. The shortest trial ever took 14 cycles, which is the theoretical optimum for Sutton's task. In the end, no weight stack had more than 7 entries (each followed by faster reinforcement intake than all the previous ones).

Ongoing work. It is intended to replace the random weight modification process by a process that may be strongly influenced by the current weights themselves. Following [30], the idea is to use activation patterns across special output units to address and modify the network's own weights. Then the recurrent net will be theoretically able to evolve its own weight modification strategies, to replace the random mutations by more directed self-mutations.

next up previous
Juergen Schmidhuber 2003-02-19