Table 1 gives an overview of various time-dependent activation vectors relevant for the description of the algorithm. Additional notation: `' is the concatenation operator; if the teacher provides a target vector at time and otherwise. If then takes on some default value, e.g. the zero vector.
------------INSERT TABLE 1 HERE--------------
|
A has input units, hidden units, and output units (see table 1). With pure prediction tasks . C has hidden units, and output units. All of A's input and hidden units have directed connections to all of A's hidden and output units. All input units of A have directed connections to all hidden and output units of C. This is because A's input units serve as input units for C at certain time steps. There are additional input units for C for providing unique representations of the current time step. These additional input units also have directed connections to all hidden and output units of C. All hidden units of C have directed connections to all hidden and output units of C.
A will try to make equal to if , and it will try to make equal to , thus trying to predict . Here again the target prediction problem is defined as a special case of an input prediction problem. C will try to make equal to the externally provided teaching vector if and if A failed to emit . Furthermore, it will always try to make equal to the next non-teaching input to be processed by C. This input may be many time steps ahead. Finally, and most importantly, A will try to make equal to , thus trying to predict the state of C. The activations of C's output units are considered as part of its state.
Both C and A simultaneously are trained by a conventional algorithm for recurrent networks in an on-line fashion. Both the IID-Algorithm and BPTT are appropriate. In particular, computationally inexpensive variants of BPTT [Williams and Peng, 1990] are interesting: There are tasks with hierarchical temporal structure where only a few iterations of `back-propagation back into time' per time step are in principle sufficient to bridge arbitrary time lags (see section 5).
I now describe the (quite familiar) procedure for updating activations in a net.
Repeat for a constant number of iterations (typically one or two):
I now specify the input-output behavior of the chunker and the automatizer as well as the details of error injection: