DETAILS OF THE 2-NET CHUNKING ARCHITECTURE

Table 1 gives an overview of various time-dependent activation vectors relevant for the description of the algorithm. Additional notation: ` $\circ$ ' is the concatenation operator; $\delta_d(t) =1$ if the teacher provides a target vector

at time

and $\delta_d(t) =0$ otherwise. If $\delta_d(t) =0$ then

takes on some default value, e.g. the zero vector.

Table 1: Definitions of symbols representing time-dependent activation vectors. ` $\circ$ ' is the concatenation operator. and are based on previous inputs and are computed without knowledge about and .

vector	description (referring to time )	dimension
	`normal' environmental input
	teacher-defined target
$i_A(t)= x(t) \circ d(t)$	A's input
	A's hidden activations	$n_{H_A}$
	A's prediction of
	A's prediction of
	unique representation of	$n_{time}$
	C's hidden activations	$n_{H_C}$
	C's prediction of 's next target input
	C's prediction of 's next `normal' input
	C's prediction of 's next `time' input	$n_{time}$
	$d_C(t) \circ p_C(t) \circ s_C(t)$	$n_{O_C} = n_{D} + n_{I} + n_{time}$
	A's prediction of $h_C(t) \circ o_C(t)$	$n_{H_C}+n_{O_C}$
	$d_A(t) \circ p_A(t) \circ q_A(t)$	$n_{O_A} = n_{D} + n_{I} + n_{H_C} + n_{O_C}$

A has $n_{I}+ n_D$ input units, $n_{H_A}$ hidden units, and $n_{O_A}$ output units (see table 1). With pure prediction tasks

. C has $n_{H_C}$ hidden units, and $n_{O_C}$ output units. All of A's input and hidden units have directed connections to all of A's hidden and output units. All input units of A have directed connections to all hidden and output units of C. This is because A's input units serve as input units for C at certain time steps. There are additional $n_{time}$ input units for C for providing unique representations of the current time step. These additional input units also have directed connections to all hidden and output units of C. All hidden units of C have directed connections to all hidden and output units of C.

A will try to make

equal to

if $\delta_d(t) =1$ , and it will try to make

equal to

, thus trying to predict

. Here again the target prediction problem is defined as a special case of an input prediction problem. C will try to make

equal to the externally provided teaching vector

if $\delta_d(t) =1$ and if A failed to emit

. Furthermore, it will always try to make $p_C(t) \circ s_C(t)$ equal to the next non-teaching input to be processed by C. This input may be many time steps ahead. Finally, and most importantly, A will try to make

equal to $h_C(t) \circ o_C(t)$ , thus trying to predict the state of C. The activations of C's output units are considered as part of its state.

Both C and A simultaneously are trained by a conventional algorithm for recurrent networks in an on-line fashion. Both the IID-Algorithm and BPTT are appropriate. In particular, computationally inexpensive variants of BPTT [Williams and Peng, 1990] are interesting: There are tasks with hierarchical temporal structure where only a few iterations of `back-propagation back into time' per time step are in principle sufficient to bridge arbitrary time lags (see section 5).

I now describe the (quite familiar) procedure for updating activations in a net.

$\textstyle \parbox{14cm}{{\em 1. For each non-input unit $j$\ of $N$\ compute $... ...$j$. \par 2. For all non-input units $j$: Set $a_j$equal to $\hat a_j $. } }$

I now specify the input-output behavior of the chunker and the automatizer as well as the details of error injection:

$\textstyle \parbox{14cm}{{\em INITIALIZATION: All weights are initialized rando... ...{\partial w_{ij}}$, and update C to obtain $h_C(t+1)$\ and $o_C(t+1)$. \par }}$