*Dynamics of conventional recurrent nets.*
To save indices, I consider a *single* discrete
sequence of real-valued input
vectors ,
, each with dimension .
In what follows, if denotes a vector
then denotes the -th component of .
If denotes a unit, then denotes the activation of the
unit at time .
's real-valued
weight at time is denoted by .
*Throughout the remainder of this paper, unquantized
variables are assumed to take on their maximal range.*
The dynamic evolution of a traditional discrete time recurrent net
(e.g. [2],
[7]) is given by

(1) |

*Typical objective function.* There may exist specified
target values for certain outputs . We define

The supervised sequence learning task is to minimize (via gradient descent)

(2) |

*Complexity of traditional recurrent net algorithms.*
Let be the number of time-varying variables for storing temporal
events. Let
be the ratio between the number of operations per time
step (for an *exact* gradient based
supervised sequence learning algorithm),
and . Let be the ratio between
the maximum number of storage cells necessary for learning arbitrary
sequences, and .

The fastest known exact gradient based learning algorithm for
minimizing (2) with fully recurrent networks is `back-propagation
through time' (BPTT, e.g. [3]).
With BPTT, , and
. BPTT's disadvantage is that
it requires storage - which means that
is infinite: BPTT is not a *fixed-size storage* algorithm.
The most well-known *fixed-size storage* learning
algorithm for minimizing
with
fully recurrent nets is RTRL
[2][8].
With RTRL, ,
(much worse than with BPTT) and
(much better than with BPTT).
^{1}

*Contribution of this paper.*
The contribution of this paper is an extension of the conventional
dynamics (as in equation (1)) plus a corresponding exact gradient based
learning algorithm with
(as good as with BPTT)
and
(as good as with RTRL).
The basic idea is: The weights themselves are included
in the set of time-varying variables that can store temporal information
(
).
Section 2 describes the novel network dynamics that
allow the network to actively and quickly manipulate
its own weights
(via so-called *intra-sequence* weight changes)
by creating certain appropriate internal activation patterns
^{2}during observation of an
input sequence^{3}.
Section 3 derives an *exact* supervised sequence learning algorithm
(for creating *inter-sequence* weight changes
which affect only the *initial* weights
at the beginning of each training sequence)
forcing the network to use its association building capabilities
to minimize
.
The gradient-based learning algorithm for inter-sequence weight
changes takes into account the fact
that intra-sequence weight changes at some point in time
may influence both activations and
intra-sequence weight changes at later points in time.

Back to Recurrent Neural Networks page