Dynamics of conventional recurrent nets.
To save indices, I consider a single discrete
sequence of real-valued input
, each with dimension .
In what follows, if denotes a vector
then denotes the -th component of .
If denotes a unit, then denotes the activation of the
unit at time .
weight at time is denoted by .
Throughout the remainder of this paper, unquantized
variables are assumed to take on their maximal range.
The dynamic evolution of a traditional discrete time recurrent net
) is given by
Typical objective function. There may exist specified
target values for certain outputs . We define
Complexity of traditional recurrent net algorithms. Let be the number of time-varying variables for storing temporal events. Let be the ratio between the number of operations per time step (for an exact gradient based supervised sequence learning algorithm), and . Let be the ratio between the maximum number of storage cells necessary for learning arbitrary sequences, and .
The fastest known exact gradient based learning algorithm for minimizing (2) with fully recurrent networks is `back-propagation through time' (BPTT, e.g. ). With BPTT, , and . BPTT's disadvantage is that it requires storage - which means that is infinite: BPTT is not a fixed-size storage algorithm. The most well-known fixed-size storage learning algorithm for minimizing with fully recurrent nets is RTRL . With RTRL, , (much worse than with BPTT) and (much better than with BPTT). 1
Contribution of this paper. The contribution of this paper is an extension of the conventional dynamics (as in equation (1)) plus a corresponding exact gradient based learning algorithm with (as good as with BPTT) and (as good as with RTRL). The basic idea is: The weights themselves are included in the set of time-varying variables that can store temporal information ( ). Section 2 describes the novel network dynamics that allow the network to actively and quickly manipulate its own weights (via so-called intra-sequence weight changes) by creating certain appropriate internal activation patterns 2during observation of an input sequence3. Section 3 derives an exact supervised sequence learning algorithm (for creating inter-sequence weight changes which affect only the initial weights at the beginning of each training sequence) forcing the network to use its association building capabilities to minimize . The gradient-based learning algorithm for inter-sequence weight changes takes into account the fact that intra-sequence weight changes at some point in time may influence both activations and intra-sequence weight changes at later points in time.