For we define

(1) |

Furthermore we define

The algorithm is a cross between the BPTT and the RTRL-algorithm. The description of the algorithm will be interleaved with its derivation and some comments concerning complexity. The basic idea is: Decompose the calculation of the gradient into blocks, each covering time steps. For each block perform BPTT-like passes, one pass for calculating error derivatives, and passes for calculating derivatives of the net-inputs to the non-input units at the end of each block. Perform RTRL-like calculations for integrating the results of these BPTT-like passes into the results obtained from previous blocks.

The algorithm starts by setting the variable . represents the beginning of the current block. Note that for all possible . The main loop of the algorithm consists of 5 steps.

STEP1: Set
(I recommend:
).

The quantity
for all
is already known and
is known
for all appropriate .
There is an efficient way of computing the contribution
of
to the change in
:

where is a learning rate constant.

STEP2: Let the network run from time step to time step
according to the activation dynamics specified in equation (1).
If it turns out that the current training
sequence has less than time steps
(i.e., ) then
. If then EXIT.

STEP3:
Perform a combination of a BPTT-like phase
with an RTRL-like calculation
for computing error
derivatives
as described next. We write

(2) |

The first term of (2) is already known. Consider the third term:

where

For a given , can be computed for all with a single step BPTT-pass of the order operations:

What remains is the computation of the second term of (2) for all , which requires operations:

STEP4: To compute for all possible , perform combinations of a BPTT-like phase with an RTRL-like calculation (one such combination for each ) for computing as follows:

(3) |

For a given , a given and for all the quantity can be computed with a single step BPTT-like operation of the order :

For a given , the computation of
(3) for all requires operations. Therefore
STEP3 and STEP4 together require
operations
*spread over time steps.*
Since ,
computations are spread over time steps.
*This implies
an average of computations per time step*.

The final step of the algorithm's main loop is

STEP5: Set
and go to STEP1.

The off-line version of the algorithm waits until the end of an episode (which needs not be known in advance) before performing weight changes. An on-line version performs weight changes each time STEP4 is completed.

As formulated above, the algorithm needs computations at its peak, every -th time step. Nothing prevents us, however, from distributing these computations more evenly over time steps. One way of achieving this is to perform one of the BPTT-like phases of STEP 4 at each time step of the next `block' of time steps.

Back to Recurrent Neural Networks page