The following algorithm for minimizing is partly inspired by conventional recurrent network algorithms (e.g. [2], [7]). The notation is partly inspired by [8].
Derivation.
Before training, all initial weights
are randomly initialized.
The chain rule serves to compute weight increments
(to be performed after each training sequence)
for all initial weights
according to
(5) |
We write
(6) |
First note that
(7) |
(8) |
(9) |
(10) |
(11) |
According to equations (8)-(11), variables holding the and values can be updated incrementally at each time step. This implies that (5) can be updated incrementally, too. With non-degenerate networks, the algorithm's storage complexity is dominated by the number of variables for storing the values. This number is independent of the sequence length and equals . Since , (like with RTRL). The computational complexity per time step also is - essentially the same as the one of RTRL. Since , however, (like with time-efficient BPTT and unlike with RTRL's much worse ).