EXPERIMENT 2: TEMPORAL ORDER

Next: ACKNOWLEDGMENTS Up: EXPERIMENTS Previous: EXPERIMENT 1: ADDING PROBLEM

EXPERIMENT 2: TEMPORAL ORDER

In this subsection, LSTM solves another task that cannot be solved at all by any other recurrent net learning algorithm we are aware of.

Task 2a: two relevant, widely separated symbols. The goal is to classify sequences. Elements are represented locally (binary input vectors with only one non-zero bit). The sequence starts with an , ends with a (the ``trigger symbol'') and otherwise consists of randomly chosen symbols from the set $\{a,b,c,d\}$ except for two elements at positions and that are either or . The sequence length is randomly chosen between and , is randomly chosen between and , and is randomly chosen between and . There are 4 sequence classes (locally represented targets) which depend on the temporal order of and . The rules are: $X,X \rightarrow Q;~~ X,Y \rightarrow R;~~ Y,X \rightarrow S;~~ Y,Y \rightarrow U$ .

Task 2b: three relevant, widely separated symbols. Again, the goal is to classify sequences. Elements are represented locally. The sequence starts with an , ends with a (the ``trigger symbol''), and otherwise consists of randomly chosen symbols from the set $\{a,b,c,d\}$ except for three elements at positions and that are either or . The sequence length is randomly chosen between and , is randomly chosen between and , is randomly chosen between and , and is randomly chosen between and . There are 8 (locally represented) sequence classes which depend on the temporal order of the s and s. The rules are: $X,X,X \rightarrow Q;~~ X,X,Y \rightarrow R;~~ X,Y,X \rightarrow S;~~ X,Y,Y \rig... ...ghtarrow V;~~ Y,X,Y \rightarrow A;~~ Y,Y,X \rightarrow B;~~ Y,Y,Y \rightarrow C$ .

With both tasks, error signals occur only at the end of a sequence. The sequence is classified correctly if the final error of all output units is below 0.3.

Architecture. We use a 3-layer net with 8 input units, 2 (3) cell blocks of size 2 for task 2a (2b), 4 (8) output units for task 2a (2b). Again, non-input units are biased, and the output layer receives connections from memory cells only. Memory cells/gate units receive inputs from input units, memory cells, gate units (fully connected hidden layer -- less connectivity works as well).

Training / Testing. The learning rate is 0.5 (0.1) for experiment 2a (2b). Training examples are generated on-line. Training is stopped if average training error is below 0.1, and the 2000 most recent sequences were classified correctly. Weights are initialized in . The first (second) input gate bias is initialized with () (again, precise initialization values don't matter much).

Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.1, and there were never more than 3 incorrectly classified sequences. The following results are means of 20 trials: For task 2a (2b), training was stopped (see stopping criterion in previous paragraph) after on average 31,390 (571,100) training sequences, and then only 1 (2) of the 2560 test sequences were not classified correctly (see definition above). Obviously, LSTM is able to extract information conveyed by the temporal order of widely separated inputs.

Conclusion. For non-trivial tasks (where RS is infeasible), we recommend LSTM.

Next: ACKNOWLEDGMENTS Up: EXPERIMENTS Previous: EXPERIMENT 1: ADDING PROBLEM

Juergen Schmidhuber 2003-02-25

Back to Recurrent Neural Networks page