The following algorithm for minimizing is partly inspired by conventional recurrent network algorithms (e.g. [2], [7]). The notation is partly inspired by [8].

*Derivation.*
Before training, all *initial* weights
are randomly initialized.
The chain rule serves to compute weight increments
(to be performed *after* each training sequence)
for all initial weights
according to

(5) |

We write

(6) |

First note that

(7) |

(8) |

(9) |

(10) |

(11) |

According to equations (8)-(11), variables holding the and values can be updated incrementally at each time step. This implies that (5) can be updated incrementally, too. With non-degenerate networks, the algorithm's storage complexity is dominated by the number of variables for storing the values. This number is independent of the sequence length and equals . Since , (like with RTRL). The computational complexity per time step also is - essentially the same as the one of RTRL. Since , however, (like with time-efficient BPTT and unlike with RTRL's much worse ).

Back to Recurrent Neural Networks page