Next: COLLAPSING THE HIERARCHY INTO Up: LEARNING COMPLEX, EXTENDED SEQUENCES Previous: HISTORY COMPRESSION

A SELF-ORGANIZING MULTI-LEVEL PREDICTOR HIERARCHY

Using the principle of history compression we can build a self-organizing hierarchical neural `chunking' system. The system detects causal dependencies in the temporal input stream and learns to attend to unexpected inputs instead of focussing on every input. It learns to reflect both the relatively local and the relatively global temporal regularities contained in the input stream.

The basic task can be formulated as a prediction task. At a given time step the goal is to predict the next input from previous inputs. If there are external target vectors at certain time steps then they are simply treated as another part of the input to be predicted.

The architecture is a hierarchy of predictors, the input to each level of the hierarchy is coming from the previous level. denotes the th level network which is trained to predict its own next input from its previous inputs². We take to be a conventional dynamic recurrent neural network [Robinson and Fallside, 1987][Williams and Zipser, 1989][Williams and Peng, 1990][Schmidhuber, 1991d]; however, it might be some other adaptive sequence processing device as well³.

At each time step the input of the lowest-level recurrent predictor is the current external input. We create a new higher-level adaptive predictor $P_{s+1}$ whenever the adaptive predictor at the previous level, , stops improving its predictions. When this happens the weight-changing mechanism of is switched off (to exclude potential instabilities caused by ongoing modifications of the lower-level predictors). If at a given time step ( $s \geq 0$ ) fails to predict its next input (or if we are at the beginning of a training sequence which usually is not predictable either) then $P_{s+1}$ will receive as input the concatenation of this next input of plus a unique representation of the corresponding time step⁴; the activations of $P_{s+1}$ 's hidden and output units will be updated. Otherwise $P_{s+1}$ will not perform an activation update. This procedure ensures that $P_{s+1}$ is fed with an unambiguous reduced description⁵of the input sequence observed by . This is theoretically justified by the principle of history compression.

In general, $P_{s+1}$ will receive fewer inputs over time than . With existing learning algorithms, the higher-level predictor should have less difficulties in learning to predict the critical inputs than the lower-level predictor. This is because $P_{s+1}$ 's `credit assignment paths' will often be short compared to those of . This will happen if the incoming inputs carry global temporal structure which has not yet been discovered by $P_{s}$ .

This method is a simplification and an improvement of the recent chunking method described by [Schmidhuber, 1991a].

Often a multi-level predictor hierarchy will be the fastest way of learning to deal with sequences with multi-level temporal structure (e.g speech). Experiments have shown that multi-level predictors can quickly learn tasks which are practically unlearnable by conventional recurrent networks, e.g. [Hochreiter, 1991]. One disadvantage of a predictor hierarchy, however, is that it is not known in advance how many levels will be needed. Another disadvantage is that levels are explicitly separated from each other. It can be possible, however, to collapse the hierarchy into a single network as described next.

Next: COLLAPSING THE HIERARCHY INTO Up: LEARNING COMPLEX, EXTENDED SEQUENCES Previous: HISTORY COMPRESSION

Juergen Schmidhuber 2003-02-13

Back to Independent Component Analysis page.

Back to Recurrent Neural Networks page