With the conventional algorithm, with various learning rates, and with more than 1,000,000 training sequences it was not possible to obtain a significant performance improvement concerning the target unit. A similar task involving time lags of as few as 5 steps required many hundreds of thousands of training sequences.
But, a chunking system was able to solve the 20-step task rather quickly, using an efficient approximation of the BPTT-method where error was propagated a maximum of 3 steps into the past (although there was a 20 step time lag!). No unique representations of time steps were necessary for this task. 13 out of 17 test runs required fewer than 5000 training sequences. The remaining test runs required fewer than 35000 training sequences.
Typically, A quickly learned to predict the `easy' symbols . This led to a greatly reduced input sequence for C which now did not have many problems in learning to predict the target values at the end of the sequences. After a while A was able to mimic C's internal representations, which in turn allowed it to learn correct target predictions by itself. A's final weight matrix often looked like one one would hope to get from the conventional algorithm: There were hidden units which learned to bridge the 20-step time lags by means of strong self-connections. The chunking system needed less computation per time step than the conventional method. Still it required many fewer training sequences.