In this subsection, LSTM solves another task that cannot be solved at all by any other recurrent net learning algorithm we are aware of.
Task 2a: two relevant, widely separated symbols. The goal is to classify sequences. Elements are represented locally (binary input vectors with only one non-zero bit). The sequence starts with an , ends with a (the ``trigger symbol'') and otherwise consists of randomly chosen symbols from the set except for two elements at positions and that are either or . The sequence length is randomly chosen between and , is randomly chosen between and , and is randomly chosen between and . There are 4 sequence classes (locally represented targets) which depend on the temporal order of and . The rules are: .
Task 2b: three relevant, widely separated symbols. Again, the goal is to classify sequences. Elements are represented locally. The sequence starts with an , ends with a (the ``trigger symbol''), and otherwise consists of randomly chosen symbols from the set except for three elements at positions and that are either or . The sequence length is randomly chosen between and , is randomly chosen between and , is randomly chosen between and , and is randomly chosen between and . There are 8 (locally represented) sequence classes which depend on the temporal order of the s and s. The rules are: .
With both tasks, error signals occur only at the end of a sequence. The sequence is classified correctly if the final error of all output units is below 0.3.
Architecture. We use a 3-layer net with 8 input units, 2 (3) cell blocks of size 2 for task 2a (2b), 4 (8) output units for task 2a (2b). Again, non-input units are biased, and the output layer receives connections from memory cells only. Memory cells/gate units receive inputs from input units, memory cells, gate units (fully connected hidden layer -- less connectivity works as well).
Training / Testing. The learning rate is 0.5 (0.1) for experiment 2a (2b). Training examples are generated on-line. Training is stopped if average training error is below 0.1, and the 2000 most recent sequences were classified correctly. Weights are initialized in . The first (second) input gate bias is initialized with () (again, precise initialization values don't matter much).
Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.1, and there were never more than 3 incorrectly classified sequences. The following results are means of 20 trials: For task 2a (2b), training was stopped (see stopping criterion in previous paragraph) after on average 31,390 (571,100) training sequences, and then only 1 (2) of the 2560 test sequences were not classified correctly (see definition above). Obviously, LSTM is able to extract information conveyed by the temporal order of widely separated inputs.
Conclusion. For non-trivial tasks (where RS is infeasible), we recommend LSTM.