The following algorithm for minimizing is partly inspired by conventional recurrent network algorithms (e.g. , ). The notation is partly inspired by .
Before training, all initial weights
are randomly initialized.
The chain rule serves to compute weight increments
(to be performed after each training sequence)
for all initial weights
First note that
According to equations (8)-(11), variables holding the and values can be updated incrementally at each time step. This implies that (5) can be updated incrementally, too. With non-degenerate networks, the algorithm's storage complexity is dominated by the number of variables for storing the values. This number is independent of the sequence length and equals . Since , (like with RTRL). The computational complexity per time step also is - essentially the same as the one of RTRL. Since , however, (like with time-efficient BPTT and unlike with RTRL's much worse ).