next up previous
Next: NOVEL DYNAMICS Up: ratio Previous: ratio


Architecture. The basic architecture considered in this paper is the one of a traditional fully recurrent sequence processing network. The network has $n$ non-input units and $n_x = O(n)$ input units. Each input unit has a directed connection to each non-input unit. Each non-input unit has a directed connection to each non-input unit. Obviously there are $(n_x + n) n = n_{conn} = O(n^2)$ connections in the network. The $k$-th input unit is denoted by $x_k$. The $k$-th non-input unit is denoted by $y_k$. The $k$-th output unit is denoted by $o_k$ (the $n_o$ output units are a subset of the non-input units). The connection from unit $j$ to unit $i$ is denoted by $w_{ij}$. For instance, one of the names of the the connection from the $j$-th input unit to the the $k$-th output unit is $w_{o_kx_j}$.

Dynamics of conventional recurrent nets. To save indices, I consider a single discrete sequence of real-valued input vectors $x(t)$, $t = 1, \ldots, n_{time}$, each with dimension $n_x$. In what follows, if $v(t)$ denotes a vector then $v_k(t)$ denotes the $k$-th component of $v(t)$. If $u$ denotes a unit, then $u(t)$ denotes the activation of the unit at time $t$. $w_{ij}$'s real-valued weight at time $t$ is denoted by $w_{ij}(t)$. Throughout the remainder of this paper, unquantized variables are assumed to take on their maximal range. The dynamic evolution of a traditional discrete time recurrent net (e.g. [2], [7]) is given by

x_k(t)\leftarrow environment,~~
y_k(t) = f...
net_{y_k}(t+1) = \sum_{units~l} w_{y_kl}(t)l(t),
\end{displaymath} (1)

where $f_i$ denotes the semi-linear activation function of unit $i$, and where $w_{y_kl}(t)$ is constant for all $t$.

Typical objective function. There may exist specified target values $d_k(t)$ for certain outputs $o_k(t)$. We define

e_k(t) = d_k(t) - o_k(t)~~if~d_k(t)~exists,~and~0~otherwise.

The supervised sequence learning task is to minimize (via gradient descent)
E^{total}(t) = \sum_{\tau = 1}^{t} E(\tau),~~where~
E(t) = \frac{1}{2} \sum_k (e_k(t))^2.
\end{displaymath} (2)

Complexity of traditional recurrent net algorithms. Let $m$ be the number of time-varying variables for storing temporal events. Let $R_{time}$ be the ratio between the number of operations per time step (for an exact gradient based supervised sequence learning algorithm), and $m$. Let $R_{space}$ be the ratio between the maximum number of storage cells necessary for learning arbitrary sequences, and $m$.

The fastest known exact gradient based learning algorithm for minimizing (2) with fully recurrent networks is `back-propagation through time' (BPTT, e.g. [3]). With BPTT, $m = n$, and $R_{time} = O(m)$. BPTT's disadvantage is that it requires $O(n_{time})$ storage - which means that $R_{space}$ is infinite: BPTT is not a fixed-size storage algorithm. The most well-known fixed-size storage learning algorithm for minimizing $E^{total}(n_{time})$ with fully recurrent nets is RTRL [2][8]. With RTRL, $m = n$, $R_{time} = O(m^3)$ (much worse than with BPTT) and $R_{space} = O(m^2)$ (much better than with BPTT). 1

Contribution of this paper. The contribution of this paper is an extension of the conventional dynamics (as in equation (1)) plus a corresponding exact gradient based learning algorithm with $R_{time} = O(m)$ (as good as with BPTT) and $R_{space} = O(m^2)$ (as good as with RTRL). The basic idea is: The $O(n^2)$ weights themselves are included in the set of time-varying variables that can store temporal information ( $m = n + O(n^2) = O(n^2)$). Section 2 describes the novel network dynamics that allow the network to actively and quickly manipulate its own weights (via so-called intra-sequence weight changes) by creating certain appropriate internal activation patterns 2during observation of an input sequence3. Section 3 derives an exact supervised sequence learning algorithm (for creating inter-sequence weight changes which affect only the initial weights at the beginning of each training sequence) forcing the network to use its association building capabilities to minimize $E^{total}(n_{time})$. The gradient-based learning algorithm for inter-sequence weight changes takes into account the fact that intra-sequence weight changes at some point in time may influence both activations and intra-sequence weight changes at later points in time.

next up previous
Next: NOVEL DYNAMICS Up: ratio Previous: ratio
Juergen Schmidhuber 2003-02-21

Back to Recurrent Neural Networks page