A major contribution of this work is an adaptive method for removing redundant information from sequences. This principle can be implemented with the help of any of the methods mentioned in the introduction.

Consider a deterministic discrete time
predictor (not necessarily a neural network)
whose state at time of sequence is described by
an environmental input vector , an internal state vector , and
an output vector . The environment may be non-deterministic.
At time , the predictor starts with
and an internal start state .
At time , the predictor computes

At time , the predictor furthermore computes

Information about the observed input sequence can be even further
compressed beyond just the unpredicted input vectors
.
It suffices to know
only those *elements* of the vectors that were
not correctly predicted.

This observation implies that we can discriminate one sequence
from another
by knowing *just the unpredicted inputs
and the corresponding time steps at which they occurred*.
No information is lost
if we ignore the expected inputs.
We do not even have to know and .
I call this *the principle of history compression*.

From a theoretical point of view it is important to know at what time an unexpected input occurs; otherwise there will be a potential for ambiguities: Two different input sequences may lead to the same shorter sequence of unpredicted inputs. With many practical tasks, however, there is no need for knowing the critical time steps (see section 5).

Back to Recurrent Neural Networks page