next up previous
Next: The NBB and Temporal Up: A Local Learning Algorithm Previous: Classifier Systems and the

The Neural Bucket Brigade (NBB)

In this section we propose a combination of principles of the bucket brigade algorithm with principles of neural networks. Competition can be introduced naturally into neural networks by a mechanism of lateral inhibition. What we still need is a mechanism analogous to the process of bidding and paying in classifier systems. This mechanism must establish recursive dependencies `through time'. We introduce a local method for shifting `weight substance' (initially provided by the environment) from weights that are allowed to transport activation information at a certain time to those weights that were `setting the stage' one time tick earlier.

The basic network structure is an arbitrary (possibly cyclic) directed graph, where the nodes are familiar processing units. Some units are used for input purposes, others serve as outputs and may be coupled with effectors that may change the environment, which in turn may change the current input. Thus we have external and internal feedback.

The set of non-input units is partitioned into predefined `competitive subsets'. All non-input units synchronously try to get activated by summing their weighted inputs at each time tick. All members of a predefined competitive subset laterally inhibit each other (by some `winner-take-all' mechanism) thus competing for being active. unlike with most other approaches to goal directed learning the basic building blocks of the network are not simple units but winner-take-all subsets, each of which should have at least two members.

All weights are randomly initialized with a positive real value, and are modifiable. Initially we will assume that there is instant decay: A unit active at time $t$ manages to send its contributions to connected units that try to get activated at $t+1$, then the sender is switched off instantly.

All units active at time $t$ take away a fraction of the positive weights of their outgoing connections (if there are any) that lead to winners active at time t+1, and distribute this `weight-substance' proportionally to the respective contributions among the incoming connections (if there are any) coming from winners (or input units) active at time t-1. Since the weights determine the context-dependent strength of a unit, winners `get paid' for setting the stage for their successors. Input units do not have any incoming connections that they could strengthen, they get activated by the environment thus representing holes through which the weight-substance of the system is leaking. The environment's influence is completed by sometimes rewarding (or punishing) the connections to currently active units in the case of useful output behavior. (An external critic decides what kind of behavior is useful.) The sum of all positive weights in the system remains constant, except for the weight-substance that is leaking through the input units and the new substance that is entering the system in the case of payoff. Thus we have a dissipative system which is consuming weight-substance provided by the environment.

More formally, at time $t$ we denote the activation of the $j$th unit by $x_{j}(t)$, the weight on the directed connection between units $i$ and $j$ by $w_{ij}(t)$, and the contribution of some connection by $c_{ij}(t)=x_{i}(t-1)w_{ij}(t-1)$.

Figure 1: Weight substance given by the environment in case of successful behavior is flowing through an agent living in a changing environment. The direction of weight flow is opposite to the direction of activation flow originating from perceptions. Credit assignment (appropriate weight changes) is done by local computations only. (See text for full explanation).

The activation rule works as follows: Unit $j$ gets activated at time $t$ if it is an input unit and receives a perception, or if it wins the competition between the units in the competitive subset it belongs to by having the largest positive net input $net_{j}(t)=\sum_{i}c_{ij}(t)$. We assume the simplest case: $x_{j}(t)$ equals $1$ if unit $j$ is active, and $0$ otherwise. (For instance, a conventional boolean unit with two possible activation states may be implemented by a competitive subset with two members.)

If non-input unit $j$ is active then its weights change according to

\Delta w_{ij}(t) =
- \lambda c_{ij}(t) +
\sum_{k \, wins}\lambda c_{jk}(t)
+ Ext_{ij}(t)

where $0<\lambda <1$ determines how much of its weight some particular connection has to pay to those connections that were responsible for setting the stage at the previous time step. $Ext_{ij}(t)$ is the `external payoff' that the environment gives to $w_{ij}$ at time $t$, and may be computed like this: If the external critic does not know at time $t$ whether useful behavior took place then $ Ext_{ij}(t) = 0 $. Else, if the critic notices a useful action , and if unit $j$ was active at time $t$, then $ Ext_{ij}(t) = \eta c_{ij}(t) $ with $\eta$ being a proportionality factor. As it will be demonstrated in the section describing the experiments, there is much room for more or less supervised strategies to determine $Ext_{ij}$: Every unit might get instructed at every time step, or just a few units at certain isolated time steps, etc.

The weights of the system (as opposed to the activations in Hopfield-networks or feedback-BP) have reached a stable state when every connection at any time is giving back as much weight-substance as it is receiving during the next time step. This means that (parallel) chains of units and connections cooperating in time have evolved.

It is important to see the local character of this method. No book-keeping of past activations is required, not even the accumulative computation of, say, a weighted sum of past activations. Each weight and each unit in principle performs the same operation at each time tick. No such things as `epoch boundaries' are required during training.

next up previous
Next: The NBB and Temporal Up: A Local Learning Algorithm Previous: Classifier Systems and the
Juergen Schmidhuber 2003-02-21

Back to Reinforcement Learning Economy page
Back to Recurrent Neural Networks page