next up previous
Next: AN EXPERIMENT Up: COLLAPSING THE HIERARCHY INTO Previous: OUTLINE

DETAILS OF THE 2-NET CHUNKING ARCHITECTURE

The system described below is the on-line version of a representative of a number of variations of the basic principle described in 4.1. See [Schmidhuber, 1991c] for various modifications.

Table 1 gives an overview of various time-dependent activation vectors relevant for the description of the algorithm. Additional notation: `$\circ $' is the concatenation operator; $ \delta_d(t) =1$ if the teacher provides a target vector $d(t)$ at time $t$ and $ \delta_d(t) =0$ otherwise. If $ \delta_d(t) =0$ then $d(t)$ takes on some default value, e.g. the zero vector.

------------INSERT TABLE 1 HERE--------------


Table 1: Definitions of symbols representing time-dependent activation vectors. `$\circ $' is the concatenation operator. $h_A(t)$ and $o_A(t)$ are based on previous inputs and are computed without knowledge about $d(t)$ and $x(t)$.
vector description (referring to time $t$) dimension
$x(t)$ `normal' environmental input $n_I$
$d(t)$ teacher-defined target $n_D$
$i_A(t)= x(t) \circ d(t)$ A's input $n_I+n_D$
$h_A(t)$ A's hidden activations $n_{H_A}$
$d_A(t)$ A's prediction of $d(t)$ $n_D$
$p_A(t)$ A's prediction of $x(t)$ $n_I$
$time(t)$ unique representation of $t$ $n_{time}$
$h_C(t)$ C's hidden activations $n_{H_C}$
$d_C(t)$ C's prediction of $C$'s next target input $n_D$
$p_C(t)$ C's prediction of $C$'s next `normal' input $n_I$
$s_C(t)$ C's prediction of $C$'s next `time' input $n_{time}$
$o_C(t)$ $d_C(t) \circ p_C(t) \circ s_C(t)$ $n_{O_C} = n_{D} + n_{I} + n_{time} $
$q_A(t)$ A's prediction of $h_C(t) \circ o_C(t)$ $n_{H_C}+n_{O_C}$
$o_A(t)$ $d_A(t) \circ p_A(t) \circ q_A(t)$ $n_{O_A} = n_{D} + n_{I} + n_{H_C} + n_{O_C} $


A has $n_{I}+ n_D$ input units, $n_{H_A}$ hidden units, and $n_{O_A}$ output units (see table 1). With pure prediction tasks $n_D=0$. C has $n_{H_C}$ hidden units, and $n_{O_C}$ output units. All of A's input and hidden units have directed connections to all of A's hidden and output units. All input units of A have directed connections to all hidden and output units of C. This is because A's input units serve as input units for C at certain time steps. There are additional $n_{time}$ input units for C for providing unique representations of the current time step. These additional input units also have directed connections to all hidden and output units of C. All hidden units of C have directed connections to all hidden and output units of C.

A will try to make $d_A(t)$ equal to $d(t)$ if $ \delta_d(t) =1$, and it will try to make $p_A(t)$ equal to $x(t)$, thus trying to predict $x(t)$. Here again the target prediction problem is defined as a special case of an input prediction problem. C will try to make $d_C(t)$ equal to the externally provided teaching vector $d(t)$ if $ \delta_d(t) =1$ and if A failed to emit $d(t)$. Furthermore, it will always try to make $p_C(t) \circ s_C(t)$ equal to the next non-teaching input to be processed by C. This input may be many time steps ahead. Finally, and most importantly, A will try to make $q_A(t)$ equal to $h_C(t) \circ o_C(t)$, thus trying to predict the state of C. The activations of C's output units are considered as part of its state.

Both C and A simultaneously are trained by a conventional algorithm for recurrent networks in an on-line fashion. Both the IID-Algorithm and BPTT are appropriate. In particular, computationally inexpensive variants of BPTT [Williams and Peng, 1990] are interesting: There are tasks with hierarchical temporal structure where only a few iterations of `back-propagation back into time' per time step are in principle sufficient to bridge arbitrary time lags (see section 5).

I now describe the (quite familiar) procedure for updating activations in a net.



Repeat for a constant number of iterations (typically one or two):



$\textstyle \parbox{14cm}{{\em
1. For each non-input unit $j$\ of $N$\ compute
$...
...$j$.
\par
2. For all non-input units $j$: Set
$a_j$equal to
$\hat a_j $.
}
}$

I now specify the input-output behavior of the chunker and the automatizer as well as the details of error injection:



$\textstyle \parbox{14cm}{{\em
INITIALIZATION: All weights are initialized rando...
...{\partial w_{ij}}$, and
update C to obtain
$h_C(t+1)$\ and $o_C(t+1)$.
\par
}}$


next up previous
Next: AN EXPERIMENT Up: COLLAPSING THE HIERARCHY INTO Previous: OUTLINE
Juergen Schmidhuber 2003-02-13


Back to Independent Component Analysis page.

Back to Recurrent Neural Networks page