EXPERIMENT 2: TEMPORAL ORDER

In this subsection, LSTM solves another task that cannot be solved at all by any other recurrent net learning algorithm we are aware of.

**Task 2a: two relevant, widely separated symbols.**
The goal is to
classify
sequences.
Elements are represented locally (binary input vectors with
only one non-zero bit).
The sequence starts with an ,
ends with
a (the ``trigger symbol'')
and otherwise consists
of randomly chosen symbols from the set
except for two elements
at positions and
that are either or .
The sequence length is randomly chosen between and
, is randomly chosen between and ,
and is randomly chosen between and .
There are 4 sequence classes
(locally represented targets)
which depend on the temporal order of
and . The rules are:
.

**Task 2b: three relevant, widely separated symbols.**
Again, the goal is to
classify
sequences.
Elements are represented locally.
The sequence starts with an ,
ends
with a (the ``trigger symbol''),
and otherwise consists
of randomly chosen symbols from the set
except for three elements
at positions and
that are either or .
The sequence length is randomly chosen between and
, is randomly chosen between and ,
is randomly chosen between and
,
and is randomly chosen between and
.
There are 8 (locally represented)
sequence classes
which depend on the temporal order of
the s and s. The rules are:
.

With both tasks, error signals occur only at the end of a sequence. The sequence is classified correctly if the final error of all output units is below 0.3.

**Architecture.** We use a 3-layer net
with 8 input units,
2 (3) cell blocks of size 2 for task 2a (2b),
4 (8) output units for task 2a (2b).
Again, non-input units are biased,
and the output layer receives connections from memory cells only.
Memory cells/gate units receive inputs from
input units, memory cells, gate units
(fully connected hidden layer
-- less connectivity works as well).

**Training / Testing.**
The learning rate is 0.5 (0.1) for experiment 2a (2b).
Training examples are generated on-line.
Training is stopped if average training error is below
0.1, and the 2000 most recent sequences were classified correctly.
Weights are initialized in .
The first (second) input gate bias is initialized with
()
(again, precise initialization values don't matter much).

**Results.**
With a test set consisting of 2560 randomly chosen sequences,
the average test set error was always below 0.1, and
there were never more than 3 incorrectly classified sequences.
The following results are means of 20 trials:
For task 2a (2b), training was stopped
(see stopping criterion in previous paragraph)
after on average
31,390 (571,100) training sequences,
and then only 1 (2) of the 2560 test sequences were
not classified correctly (see definition above).
Obviously, LSTM is able to extract information
conveyed by the temporal order of widely separated inputs.

**Conclusion.**
For non-trivial
tasks (where RS is infeasible), we
recommend LSTM.

Back to Recurrent Neural Networks page