Next: 4. CONCLUSION Up: selfref Previous: 2.1. `SELF-REFERENTIAL' DYNAMICS AND

3. INITIAL LEARNING ALGORITHM

The following algorithm¹for minimizing $E^{total}$ is partly inspired by (but more complex than) conventional recurrent network algorithms (e.g. Robinson and Fallside [2]).

Derivation of the algorithm. We use the chain rule to compute weight increments (to be performed after each training sequence) for all initial weights $w_{ab}(1)$ according to

$\begin{displaymath} w_{ab}(1) \leftarrow w_{ab}(1) - \eta \frac{\partial E^{total}(n_rn_s)} {\partial w_{ab}(1)} , \end{displaymath}$

(5)

where $\eta$ is a constant positive `learning rate'. Thus we obtain an exact gradient-based algorithm for minimizing $E^{total}$ under the `self-referential' dynamics given by (1)-(4). To reduce writing effort, I introduce some short-hand notation partly inspired by Williams [7]. For all units

and all weights $w_{ab}$ , $w_{ij}$ we write

$\begin{displaymath} p_{ab}^u(t)= \frac{\partial u(t)} {\partial w_{ab}(1)} ,~~... ...}_{ab}(t) = \frac{\partial w_{ij}(t)} {\partial w_{ab}(1)} . \end{displaymath}$

(6)

To begin with, note that

$\begin{displaymath} \frac{\partial E^{total}(1) } {\partial w_{ab}(1)} = 0,~~ ... ... {\partial w_{ab}(1)} - \sum_k eval_k(t+1) p_{ab}^{o_k}(t). \end{displaymath}$

(7)

Therefore, the remaining problem is to compute the $p_{ab}^{o_k}(t)$ , which can be done by incrementally computing all $p_{ab}^{z_k}(t)$ and $q^{ij}_{ab}(t)$ , as we will see. We have

$\begin{displaymath} p_{ab}^{z_k}(1)= 0; ~~ p_{ab}^{x_k}(t+1) = 0;~~ p_{ab}^{eval... ...+1) = - p_{ab}^{o_k}(t), ~~if~d_k(t)~exists,~and~0~otherwise, \end{displaymath}$

(8)

$\begin{displaymath} p^{val}_{ab}(t+1) = \sum_{i,j} \{~~ q^{ij}_{ab}(t) g[\V... ...w_{ij}(t)~ [~~ g'(\Vert ana(t) - adr(w_{ij})\Vert^2 ) \times \end{displaymath}$

$\begin{displaymath} \times 2 \sum_m (ana_m(t) - adr_m(w_{ij})) p_{ab}^{ana_m}(t) ~~]~ ~~\} \end{displaymath}$

(9)

(where $adr_m(w_{ij})$ is the

-th bit of $w_{ij}$ 's address),

$\begin{displaymath} p_{ab}^{y_k}(t+1)= f_{y_k}'(net_{y_k}(t+1)) \sum_l w_{y_kl}(t) p_{ab}^{l}(t) + l(t) q_{ab}^{y_kl}(t) , \end{displaymath}$

(10)

where

$\begin{displaymath} q^{ij}_{ab}(1) = 1~if~w_{ab}=w_{ij},~and~0~otherwise, \end{displaymath}$

(11)

$\begin{displaymath} \forall t>1:~~ q^{ij}_{ab}(t) = q^{ij}_{ab}(t-1) + p_{ab}^{\bigtriangleup}(t-1) g(\Vert mod(t-1) - adr(w_{ij})\Vert^2) + \end{displaymath}$

$\begin{displaymath} + 2 \bigtriangleup(t-1)~ g'(\Vert mod(t-1) - adr(w_{ij})\Ver... ...es \sum_m [mod_m(t-1) - adr_m(w_{ij})] p_{ab}^{mod_m}(t-1) . \end{displaymath}$

(12)

According to (8)-(12), the $p^j_{ab}(t)$ and $q^{ij}_{ab}(t)$ can be updated incrementally at each time step. This implies that (5) can be updated incrementally at each time step, too. The storage complexity is independent of the sequence length and equals $O(n_{conn}^2)$ . The computational complexity per time step (of sequences with arbitrary length) is $O(n_{conn}^2 log n_{conn})$ .

Next: 4. CONCLUSION Up: selfref Previous: 2.1. `SELF-REFERENTIAL' DYNAMICS AND

Juergen Schmidhuber 2003-02-21

Back to Metalearning page
Back to Recurrent Neural Networks page