In all cases, the discrete time version of the algorithm described above was employed. 32-bit floating point arithmetic was used for the simulations.

*XOR-problems.*
Any algorithm for
learning sequential tasks
should also allow the learning of static pattern association,
since static learning tasks can
be viewed as sequential tasks where
inputs and desired outputs do not change over time.

In a preliminary experiment we tested whether the NBB is capable of adjusting a network such that it solves a static non-linearily separable task. The classical example for such a task is the XOR-problem.

The network was of the feed-forward type: A layer of three input units was connected to a predefined competitive subset of three hidden units and a predefined competitive subset of two output units. The subset of hidden units also was connected to the subset of output-units.

At the beginning of each cycle all unit activations were reset to , and one of the four binary XOR input patterns was randomly chosen. During the cycle this pattern was presented to the first two input units for a period of 6 time ticks. The activation of the remaining input unit was always set to , in order to provide a modifiable bias for every non-input unit in the network.

The task for the network was to switch on the first output unit if the XOR of the input pattern was , and to switch on the second output unit otherwise. The task was formulated as a reinforcement learning task: At each time tick the environment gave a payoff (playing a role similar to the role of a reinforcement signal) to if unit was an output unit and if it was switched on correctly at time t. In all other cases was set equal to . (Recall that payoff can be considered as a bit of weight substance which has to be distributed in an appropriate way by the NBB algorithm.)

The network was said to correctly classify a single pattern if it switched on the corresponding output unit during the last three time ticks of a cycle, without the weight changing mechanism being employed. The network was said to have solved the problem if it correctly classified the four input patterns. To test whether the network had already solved the problem, after each training cycle the weight changing mechanism became disengaged, and the network's classification capabilities were tested on each of the four input patterns.

During 20 test runs the network needed an average of 619 pattern presentations to find a solution. So each of the four patterns had to be presented for about 155 times, which corresponds to the notion of 155 `epochs'.

Most of the 20 solutions were brittle in the sense that further training did not necessarily stabilize them. For instance, after a solution had been found, another 5 training cycles could lead to worse performance again. So we measured the number of cycles needed to achieve a stable solution.

A solution was considered to be stable if 100 additional pattern presentations did not disturb the performance of the network. The precise testing procedure was as follows: After each random pattern presentation the weight changing mechanism became disengaged. In a second phase a new pattern was randomly selected, and it was tested whether the network could classify it correctly. Then the bucket brigade mechanism was switched on again. This procedure was reiterated until the network produced 100 correct classifications (in the second phase) in a row. During 10 test runs it was found that each pattern had to be presented for about 674 times in order to reach this criterion.

Using two instead of three hidden units, an average of 160 presentations per pattern was required to find a solution for the problem. However, it was not possible to obtain stable solutions, according to the criterion above.

The XOR problem was also tested with a different network architecture: Instead of using straight-through connections from the three input units to the output units, the input units were connected only to two hidden competitive subsets, each containing two units. Both hidden subsets were connected to the output subset. Both hidden subsets also received input from a unit which was always on. In 10 test runs the network failed twice to find a solution within 4000 random pattern presentations. During the remaining 8 test runs an average of 263 presentations per pattern was required to solve the problem. Similarily, for 8 out of 10 test runs an average of 911 presentations per pattern was required to find a stable solution, according to the criterion above.

*Encoding-problems.*
Another task that had to be solved was an `encoding problem'.
Eight 8-dimensional binary patterns, each having unit length,
had to be associated with themselves. A bottleneck of hidden
units made the task non-trivial: 8 input units were connected
to three hidden competitive subsets containing two units each.
These were connected to a competitive subset of 8 output units.
(An additional unit that was always on was connected to all
non-input units.)
Note that since each of the hidden subsets could have only two
different states, an extreme solution was required:
It was necessary to fully exhaust
the representation capacity of the bottleneck.

The learning mechanism employed for this problem illustrates
how the bucket brigade mechanism can be employed in a more
supervised manner: Unlike with the XOR-problems above, the
connections leading to the
output unit which *should* have been activated in response to a
given input pattern received external payoff, even if the output
unit erroneously had not been activated. Besides this
modification, the learning and testing procedures were the
same as with the
XOR-problems.

During 10 test runs, an average of 1364 presentations per pattern was necessary to obtain a solution. However, it was not possible to obtain stable complete solutions.

Widening the bottleneck to 4 hidden subsets allowed the algorithm to find stable solutions. About 2460 presentations per pattern sufficed to satisfy the criterion for stability (10 test runs were conducted).

We also conducted experiments where the input patterns for successive cycles were not chosen randomly but in periodical sequential order. Here we found that the average time to find a solution increased, and that the solutions tended to be more brittle in the sense that it took much longer to achieve stable solutions. This suggests that the random element in the process of pattern selection introduces a stabilizing effect. One might suppose that similar stabilizing effects could be achieved by using stochastic activation rules. However, this has not been tested.

Where can instabilities arise from? The brittleness of the first solutions was attributed to the empirically observed fact that competing units within a competitive subset often had very similar net inputs. This in turn was attributed to a property of the NBB algorithm: Weights of connections leading to units that loose a competition remain the same. Consider a unit that does not participate in a bucket brigade chain causing a correct classification of pattern . This means that 's weights do not change. If the weights of are slightly increased during the presentation of another correctly classified pattern, this modification may also lead to a winning situation for during the next presentation of . This may be the case if the net input of the competitor of who usually won during 's presentation was only slightly larger than 's net input. This may cause an incorrect classification of . The interplay of these effects may lead to instabilities.

*Sequence generation.*
Important *raisons d'être* for the NBB are given by
time-varying inputs or outputs.
The task described next required oscillatory behavior of certain
outputs in response to a stationary input.

Two input units were connected to a competitive subset of three output units, which were fully interconnected. The task was to switch on the first and the second output unit in an alternating manner as long as the first input unit was switched on. The second input unit served to provide stop-signals: Its activation had to be answered by a stationary output of the third output unit.

The learning procedure for this problem demonstrates how `teacher forcing' (applied by Williams and Zipser to a similar problem [Williams and Zipser, 1988] ) can be incorporated into the NBB in a straight-forward manner: Instead of using the actual activations of the output units at time for computing the outputs at time , the desired activations at time were used.

While the network was continually running, the input units were activated randomly: The probability that the first input was switched on at a given time tick was 75 percent, the probability that the second input unit was switched on was 25 percent. Payoff was given whenever the correct output unit was switched on at a given time tick. Within less than 30 time ticks the sytem found stable solutions for this task. However, similar to the quite different algorithm employed by Williams and Zipser, without teacher forcing the task could not be reliably learned.

*Sequence recognition.*
One of the simplest tasks involving non-stationary
environments may be to
recognize different kinds of motion.
We conducted a simple experiment with time-varying perceptions.
A one dimensional `retina' consisting of 5 input units (plus one
additional unit which was always turned on) was fully
connected to a competitive subset of two ouput units. This subset of
output units was completely connected to itself, in order
to allow
recurrency. The task for the network was
to switch on the first
output unit after an illumination point has wandered across the
retina from the left to the right (within 5 time ticks), and
to switch on the first
output unit after the illumination point has wandered
from the right to the left.

During one cycle one of the two sequences (which had been chosen randomly) was presented to the network twice. Payoff was given as described for the stationary XOR experiments. In 1 out of 10 test runs the network did not find a stable solution within 3000 cycles (according to a criterion analogue to the one used for the stationary experiment). In the remaining 9 test runs an average of 223 cycles per sequence was needed to achieve a stable solution.

The experiments described above share a rather simple nature. It remains to be seen how well the NBB can deal with more difficult problems, like the learning of motor control for autonomous agents in a changing environment.

Back to Reinforcement Learning Economy page

Back to Recurrent Neural Networks page