next up previous
Next: CONCLUDING REMARKS Up: LEARNING COMPLEX, EXTENDED SEQUENCES Previous: DETAILS OF THE 2-NET

AN EXPERIMENT

Josef Hochreiter (a student at TUM) tested a chunking system against a conventional recurrent net algorithm. See [Hochreiter, 1991] and [Schmidhuber, 1991c] for details. A prediction task with a 20-step time lag was constructed. There were 22 possible input symbols $a, x, b_1, b_2, \ldots, b_{20}$. The learning systems observed one input symbol at a time. There were only two possible input sequences: $a b_1 \ldots b_{20}$ and $x b_1 \ldots b_{20}$. These were presented to the learning systems in random order. At a given time step, one goal was to predict the next input (note that in general it was not possible to predict the first symbol of each sequence due to the random occurrence of $x$ and $a$). The second (and more difficult) goal was to make the activation of a particular output unit (the `target unit') equal to 1 whenever the last 21 processed input symbols were $a, b_1, \ldots, b_{20}$ and to make this activation 0 whenever the last 21 processed input symbols were $x, b_1, \ldots, b_{20}$. No episode boundaries were used: Input sequences were fed to the learning systems without providing information about their beginnings and their ends. Therefore there was a continuous stream of input events.

With the conventional algorithm, with various learning rates, and with more than 1,000,000 training sequences it was not possible to obtain a significant performance improvement concerning the target unit. A similar task involving time lags of as few as 5 steps required many hundreds of thousands of training sequences.

But, a chunking system was able to solve the 20-step task rather quickly, using an efficient approximation of the BPTT-method where error was propagated a maximum of 3 steps into the past (although there was a 20 step time lag!). No unique representations of time steps were necessary for this task. 13 out of 17 test runs required fewer than 5000 training sequences. The remaining test runs required fewer than 35000 training sequences.

Typically, A quickly learned to predict the `easy' symbols $b_2, \ldots, b_{20}$. This led to a greatly reduced input sequence for C which now did not have many problems in learning to predict the target values at the end of the sequences. After a while A was able to mimic C's internal representations, which in turn allowed it to learn correct target predictions by itself. A's final weight matrix often looked like one one would hope to get from the conventional algorithm: There were hidden units which learned to bridge the 20-step time lags by means of strong self-connections. The chunking system needed less computation per time step than the conventional method. Still it required many fewer training sequences.


next up previous
Next: CONCLUDING REMARKS Up: LEARNING COMPLEX, EXTENDED SEQUENCES Previous: DETAILS OF THE 2-NET
Juergen Schmidhuber 2003-02-13


Back to Independent Component Analysis page.

Back to Recurrent Neural Networks page