next up previous
Next: CONCLUDING REMARKS Up: ratio Previous: NOVEL DYNAMICS

SUPERVISED LEARNING ALGORITHM

The following algorithm for minimizing $E^{total}$ is partly inspired by conventional recurrent network algorithms (e.g. [2], [7]). The notation is partly inspired by [8].

Derivation. Before training, all initial weights $w_{ab}(1)$ are randomly initialized. The chain rule serves to compute weight increments (to be performed after each training sequence) for all initial weights according to

\begin{displaymath}
w_{ab}(1) \leftarrow
w_{ab}(1) - \eta \frac{\partial E^{total}(n_{time})}
{\partial w_{ab}(1)} ,
\end{displaymath} (5)

where $\eta$ is a constant positive `learning rate'. Thus we obtain an exact gradient-based algorithm for minimizing $E^{total}$ under the dynamics given by (3) and (4).

We write

\begin{displaymath}
q^{ij}_{ab}(t)
=
\frac{\partial w_{ij}(t)}
{\partial w_{ab...
...s~u: p_{ab}^u(t)=
\frac{\partial u(t)}
{\partial w_{ab}(1)}.
\end{displaymath} (6)

(Recall that unquantized variables are assumed to take on their maximal range.)

First note that

\begin{displaymath}
\frac{\partial E^{total}(1) }
{\partial w_{ab}(1)} = 0,~~\...
...t-1)}
{\partial w_{ab}(1)} -
\sum_k e_k(t) p_{ab}^{o_k}(t).
\end{displaymath} (7)

Therefore, the remaining problem is to compute the $p_{ab}^{o_k}(t)$, which can be done by incrementally computing all $p_{ab}^{z_k}(t)$ and $q^{ij}_{ab}(t)$:


\begin{displaymath}
p_{ab}^{z_k}(1)= 0, ~~
p_{ab}^{x_k}(t+1) = 0,
\end{displaymath} (8)


\begin{displaymath}
p_{ab}^{y_k}(t+1)= f_{y_k}'(net_{y_k}(t+1)) \sum_l
\frac{\partial}
{\partial w_{ab}(1)} [ l(t)w_{y_kl}(t) ] =
\end{displaymath}


\begin{displaymath}
f_{y_k}'(net_{y_k}(t+1)) \sum_l
\left[
w_{y_kl}(t) p_{ab}^{l}(t) +
l(t) q_{ab}^{y_kl}(t)
\right],
\end{displaymath} (9)

where
\begin{displaymath}
q^{ij}_{ab}(1) = 1~if~w_{ab}=w_{ij},~and~0~otherwise,
\end{displaymath} (10)


\begin{displaymath}
q^{ij}_{ab}(t+1) =
\sigma_{ij}' (w_{ij}(t+1))
\left[
q^{ij...
...p_{ab}^i(t+1)g(j(t)) +
h(i(t+1)) p_{ab}^j(t)g'(j(t))
\right].
\end{displaymath} (11)

According to equations (8)-(11), variables holding the $p^j_{ab}(t)$ and $q^{ij}_{ab}(t)$ values can be updated incrementally at each time step. This implies that (5) can be updated incrementally, too. With non-degenerate networks, the algorithm's storage complexity is dominated by the number of variables for storing the $q^{ij}_{ab}(t)$ values. This number is independent of the sequence length and equals $O(n_{conn}^2)$. Since $m = O(n_{conn})$, $R_{space} = O(m^2)$ (like with RTRL). The computational complexity per time step also is $O(n_{conn}^2)$ - essentially the same as the one of RTRL. Since $m = O(n_{conn})$, however, $R_{time} = O(m)$ (like with time-efficient BPTT and unlike with RTRL's much worse $R_{time} = O(m^3)$).


next up previous
Next: CONCLUDING REMARKS Up: ratio Previous: NOVEL DYNAMICS
Juergen Schmidhuber 2003-02-21


Back to Recurrent Neural Networks page