The initial state vector is the same for all sequences . The input at time of sequence is the concatenation of the input and the last internal state . The output is itself.
We minimize and maximize essentially the same
objective functions as described above. That is,
for the -th module which now needs
recurrent connections to itself and the other modules,
there is again
an adaptive
predictor which need not be recurrent.
's
input at time is the concatenation
of the outputs of
all units .
's one-dimensional output
is trained to equal the expectation of the
output , given the outputs of the other units,
,
by defining 's error function as
The only way a unit can protect itself from being predictable from the other units is to store properties of the input sequences that are independent of aspects stored by the other units. In other words, this method will tend to throw away redundant temporal information much as the systems in (Schmidhuber, 1992a) and (Schmidhuber, 1992b) . For computing weight changes, each module looks back only to the last time step. In the on-line case, this implies an entirely local learning algorithm. Still, even when there are long time lags, the algorithm theoretically may learn unique representations of extended sequences - as can be seen by induction over the length of the longest training sequence:
1. can learn unique representations of the beginnings of all sequences.
2. Suppose all sequences and sub-sequences with length are uniquely represented in . Then, by looking back only one time step at a time, can learn unique representations of all sub-sequences with length .
The argument neglects all on-line effects and possible cross-talk.
On-line variants of the system described above were implemented by Daniel Prelinger. Preliminary experiments were conducted with the resulting recurrent systems. These experiments demonstrated that there are entirely local sequence learning methods that allow for learning unique representations of all subsequences of non-trivial sequences (like a sequence consisting of 8 consecutive presentations of the same input pattern represented by the activation of a single input unit). Best results were obtained by introducing additional modifications (like other error functions than mean squared error for the representational modules). A future paper will elaborate on sequence learning by predictability minimization.