Next: CONCLUDING REMARKS Up: ratio Previous: NOVEL DYNAMICS

SUPERVISED LEARNING ALGORITHM

The following algorithm for minimizing $E^{total}$ is partly inspired by conventional recurrent network algorithms (e.g. [2], [7]). The notation is partly inspired by [8].

Derivation. Before training, all initial weights $w_{ab}(1)$ are randomly initialized. The chain rule serves to compute weight increments (to be performed after each training sequence) for all initial weights according to

$\begin{displaymath} w_{ab}(1) \leftarrow w_{ab}(1) - \eta \frac{\partial E^{total}(n_{time})} {\partial w_{ab}(1)} , \end{displaymath}$

(5)

where $\eta$ is a constant positive `learning rate'. Thus we obtain an exact gradient-based algorithm for minimizing $E^{total}$ under the dynamics given by (3) and (4).

We write

$\begin{displaymath} q^{ij}_{ab}(t) = \frac{\partial w_{ij}(t)} {\partial w_{ab... ...s~u: p_{ab}^u(t)= \frac{\partial u(t)} {\partial w_{ab}(1)}. \end{displaymath}$

(6)

(Recall that unquantized variables are assumed to take on their maximal range.)

First note that

$\begin{displaymath} \frac{\partial E^{total}(1) } {\partial w_{ab}(1)} = 0,~~\... ...t-1)} {\partial w_{ab}(1)} - \sum_k e_k(t) p_{ab}^{o_k}(t). \end{displaymath}$

(7)

Therefore, the remaining problem is to compute the $p_{ab}^{o_k}(t)$ , which can be done by incrementally computing all $p_{ab}^{z_k}(t)$ and $q^{ij}_{ab}(t)$ :

$\begin{displaymath} p_{ab}^{z_k}(1)= 0, ~~ p_{ab}^{x_k}(t+1) = 0, \end{displaymath}$

(8)

$\begin{displaymath} p_{ab}^{y_k}(t+1)= f_{y_k}'(net_{y_k}(t+1)) \sum_l \frac{\partial} {\partial w_{ab}(1)} [ l(t)w_{y_kl}(t) ] = \end{displaymath}$

$\begin{displaymath} f_{y_k}'(net_{y_k}(t+1)) \sum_l \left[ w_{y_kl}(t) p_{ab}^{l}(t) + l(t) q_{ab}^{y_kl}(t) \right], \end{displaymath}$

(9)

where

$\begin{displaymath} q^{ij}_{ab}(1) = 1~if~w_{ab}=w_{ij},~and~0~otherwise, \end{displaymath}$

(10)

$\begin{displaymath} q^{ij}_{ab}(t+1) = \sigma_{ij}' (w_{ij}(t+1)) \left[ q^{ij... ...p_{ab}^i(t+1)g(j(t)) + h(i(t+1)) p_{ab}^j(t)g'(j(t)) \right]. \end{displaymath}$

(11)

According to equations (8)-(11), variables holding the $p^j_{ab}(t)$ and $q^{ij}_{ab}(t)$ values can be updated incrementally at each time step. This implies that (5) can be updated incrementally, too. With non-degenerate networks, the algorithm's storage complexity is dominated by the number of variables for storing the $q^{ij}_{ab}(t)$ values. This number is independent of the sequence length and equals $O(n_{conn}^2)$ . Since $m = O(n_{conn})$ , $R_{space} = O(m^2)$ (like with RTRL). The computational complexity per time step also is $O(n_{conn}^2)$ - essentially the same as the one of RTRL. Since $m = O(n_{conn})$ , however, $R_{time} = O(m)$ (like with time-efficient BPTT and unlike with RTRL's much worse $R_{time} = O(m^3)$ ).

Next: CONCLUDING REMARKS Up: ratio Previous: NOVEL DYNAMICS

Juergen Schmidhuber 2003-02-21

Back to Recurrent Neural Networks page