Next: NOVEL DYNAMICS Up: ratio Previous: ratio

# INTRODUCTION

Architecture. The basic architecture considered in this paper is the one of a traditional fully recurrent sequence processing network. The network has non-input units and input units. Each input unit has a directed connection to each non-input unit. Each non-input unit has a directed connection to each non-input unit. Obviously there are connections in the network. The -th input unit is denoted by . The -th non-input unit is denoted by . The -th output unit is denoted by (the output units are a subset of the non-input units). The connection from unit to unit is denoted by . For instance, one of the names of the the connection from the -th input unit to the the -th output unit is .

Dynamics of conventional recurrent nets. To save indices, I consider a single discrete sequence of real-valued input vectors , , each with dimension . In what follows, if denotes a vector then denotes the -th component of . If denotes a unit, then denotes the activation of the unit at time . 's real-valued weight at time is denoted by . Throughout the remainder of this paper, unquantized variables are assumed to take on their maximal range. The dynamic evolution of a traditional discrete time recurrent net (e.g. [2], [7]) is given by

 (1)

where denotes the semi-linear activation function of unit , and where is constant for all .

Typical objective function. There may exist specified target values for certain outputs . We define

 (2)

Complexity of traditional recurrent net algorithms. Let be the number of time-varying variables for storing temporal events. Let be the ratio between the number of operations per time step (for an exact gradient based supervised sequence learning algorithm), and . Let be the ratio between the maximum number of storage cells necessary for learning arbitrary sequences, and .

The fastest known exact gradient based learning algorithm for minimizing (2) with fully recurrent networks is `back-propagation through time' (BPTT, e.g. [3]). With BPTT, , and . BPTT's disadvantage is that it requires storage - which means that is infinite: BPTT is not a fixed-size storage algorithm. The most well-known fixed-size storage learning algorithm for minimizing with fully recurrent nets is RTRL [2][8]. With RTRL, , (much worse than with BPTT) and (much better than with BPTT). 1

Contribution of this paper. The contribution of this paper is an extension of the conventional dynamics (as in equation (1)) plus a corresponding exact gradient based learning algorithm with (as good as with BPTT) and (as good as with RTRL). The basic idea is: The weights themselves are included in the set of time-varying variables that can store temporal information ( ). Section 2 describes the novel network dynamics that allow the network to actively and quickly manipulate its own weights (via so-called intra-sequence weight changes) by creating certain appropriate internal activation patterns 2during observation of an input sequence3. Section 3 derives an exact supervised sequence learning algorithm (for creating inter-sequence weight changes which affect only the initial weights at the beginning of each training sequence) forcing the network to use its association building capabilities to minimize . The gradient-based learning algorithm for inter-sequence weight changes takes into account the fact that intra-sequence weight changes at some point in time may influence both activations and intra-sequence weight changes at later points in time.

Next: NOVEL DYNAMICS Up: ratio Previous: ratio
Juergen Schmidhuber 2003-02-21

Back to Recurrent Neural Networks page