For we define
Furthermore we define
The algorithm is a cross between the BPTT and the RTRL-algorithm. The description of the algorithm will be interleaved with its derivation and some comments concerning complexity. The basic idea is: Decompose the calculation of the gradient into blocks, each covering time steps. For each block perform BPTT-like passes, one pass for calculating error derivatives, and passes for calculating derivatives of the net-inputs to the non-input units at the end of each block. Perform RTRL-like calculations for integrating the results of these BPTT-like passes into the results obtained from previous blocks.
The algorithm starts by setting the variable . represents the beginning of the current block. Note that for all possible . The main loop of the algorithm consists of 5 steps.
STEP1: Set (I recommend: ).
is already known and
for all appropriate .
There is an efficient way of computing the contribution
to the change in
STEP2: Let the network run from time step to time step according to the activation dynamics specified in equation (1). If it turns out that the current training sequence has less than time steps (i.e., ) then . If then EXIT.
STEP3: Perform a combination of a BPTT-like phase with an RTRL-like calculation for computing error derivatives as described next. We write
The first term of (2) is already known. Consider the third term:
STEP4: To compute for all possible , perform combinations of a BPTT-like phase with an RTRL-like calculation (one such combination for each ) for computing as follows:
For a given , the computation of (3) for all requires operations. Therefore STEP3 and STEP4 together require operations spread over time steps. Since , computations are spread over time steps. This implies an average of computations per time step.
The final step of the algorithm's main loop is
STEP5: Set and go to STEP1.
The off-line version of the algorithm waits until the end of an episode (which needs not be known in advance) before performing weight changes. An on-line version performs weight changes each time STEP4 is completed.
As formulated above, the algorithm needs computations at its peak, every -th time step. Nothing prevents us, however, from distributing these computations more evenly over time steps. One way of achieving this is to perform one of the BPTT-like phases of STEP 4 at each time step of the next `block' of time steps.