next up previous
Next: ACKNOWLEDGMENTS Up: EXPERIMENTS Previous: EXPERIMENT 1: ADDING PROBLEM


EXPERIMENT 2: TEMPORAL ORDER

In this subsection, LSTM solves another task that cannot be solved at all by any other recurrent net learning algorithm we are aware of.

Task 2a: two relevant, widely separated symbols. The goal is to classify sequences. Elements are represented locally (binary input vectors with only one non-zero bit). The sequence starts with an $E$, ends with a $B$ (the ``trigger symbol'') and otherwise consists of randomly chosen symbols from the set $\{a,b,c,d\}$ except for two elements at positions $t_1$ and $t_2$ that are either $X$ or $Y$. The sequence length is randomly chosen between $100$ and $110$, $t_1$ is randomly chosen between $10$ and $20$, and $t_2$ is randomly chosen between $50$ and $60$. There are 4 sequence classes $Q,R,S,U$ (locally represented targets) which depend on the temporal order of $X$ and $Y$. The rules are: $X,X \rightarrow Q;~~ X,Y \rightarrow R;~~
Y,X \rightarrow S;~~ Y,Y \rightarrow U$.

Task 2b: three relevant, widely separated symbols. Again, the goal is to classify sequences. Elements are represented locally. The sequence starts with an $E$, ends with a $B$ (the ``trigger symbol''), and otherwise consists of randomly chosen symbols from the set $\{a,b,c,d\}$ except for three elements at positions $t_1,t_2$ and $t_3$ that are either $X$ or $Y$. The sequence length is randomly chosen between $100$ and $110$, $t_1$ is randomly chosen between $10$ and $20$, $t_2$ is randomly chosen between $33$ and $43$, and $t_2$ is randomly chosen between $66$ and $76$. There are 8 (locally represented) sequence classes $Q,R,S,U,V,A,B,C$ which depend on the temporal order of the $X$s and $Y$s. The rules are: $X,X,X \rightarrow Q;~~ X,X,Y \rightarrow R;~~
X,Y,X \rightarrow S;~~ X,Y,Y \rig...
...ghtarrow V;~~ Y,X,Y \rightarrow A;~~
Y,Y,X \rightarrow B;~~ Y,Y,Y \rightarrow C$.

With both tasks, error signals occur only at the end of a sequence. The sequence is classified correctly if the final error of all output units is below 0.3.

Architecture. We use a 3-layer net with 8 input units, 2 (3) cell blocks of size 2 for task 2a (2b), 4 (8) output units for task 2a (2b). Again, non-input units are biased, and the output layer receives connections from memory cells only. Memory cells/gate units receive inputs from input units, memory cells, gate units (fully connected hidden layer -- less connectivity works as well).

Training / Testing. The learning rate is 0.5 (0.1) for experiment 2a (2b). Training examples are generated on-line. Training is stopped if average training error is below 0.1, and the 2000 most recent sequences were classified correctly. Weights are initialized in $[-0.1,0.1]$. The first (second) input gate bias is initialized with $-2.0$ ($-4.0$) (again, precise initialization values don't matter much).

Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.1, and there were never more than 3 incorrectly classified sequences. The following results are means of 20 trials: For task 2a (2b), training was stopped (see stopping criterion in previous paragraph) after on average 31,390 (571,100) training sequences, and then only 1 (2) of the 2560 test sequences were not classified correctly (see definition above). Obviously, LSTM is able to extract information conveyed by the temporal order of widely separated inputs.

Conclusion. For non-trivial tasks (where RS is infeasible), we recommend LSTM.


next up previous
Next: ACKNOWLEDGMENTS Up: EXPERIMENTS Previous: EXPERIMENT 1: ADDING PROBLEM
Juergen Schmidhuber 2003-02-25


Back to Recurrent Neural Networks page