The experiment will show that LSTM can solve non-trivial, complex long time lag problems involving distributed, high-precision, continuous-valued representations.
Task. Each element of each input sequence is a pair consisting of two components. The first component is a real value randomly chosen from the interval . The second component is either 1.0, 0.0, or -1.0, and is used as a marker: at the end of each sequence, the task is to output the sum of the first components of those pairs that are marked by second components equal to 1.0. The value is used to determine average sequence length, which is a randomly chosen integer between and . With a given sequence, exactly two pairs are marked as follows: we first randomly select and mark one of the first ten pairs (whose first component is called ). Then we randomly select and mark one of the first still unmarked pairs (whose first component is called ). The second components of the remaining pairs are zero except for the first and final pair, whose second components are -1 ( is set to zero in the rare case where the first pair of the sequence got marked). An error signal is generated only at the sequence end: the target is (the sum scaled to the interval ). A sequence was processed correctly if the absolute error at the sequence end is below 0.04.
Architecture. We use a 3-layer net with 2 input units, 1 output unit, and 2 memory cell blocks of size 2 (a cell block size of 1 works well, too). The output layer receives connections only from memory cells. Memory cells/ gate units receive inputs from memory cells/gate units (fully connected hidden layer).
State drift versus initial bias. Note that the task requires to store the precise values of real numbers for long durations -- the system must learn to protect memory cell contents against even minor ``internal state drifts''. Our simple but highly effective way of solving drift problems at the beginning of learning is to initially bias the input gate towards zero. There is no need for fine tuning initial bias: with sigmoid logistic activation functions, the precise initial bias hardly matters because vastly different initial bias values produce almost the same near-zero activations. In fact, the system itself learns to generate the most appropriate input gate bias. To study the significance of the drift problem, we bias all non-input units, thus artificially inducing internal state drifts. Weights (including bias weights) are randomly initialized in the range . The first (second) input gate bias is initialized with () (recall that the precise initialization values hardly matters, as confirmed by additional experiments).
Training / Testing. The learning rate is 0.5. Training examples are generated on-line. Training is stopped if the average training error is below 0.01, and the 2000 most recent sequences were processed correctly (see definition above).
Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.01, and there were never more than 3 incorrectly processed sequences. The following results are means of 10 trials: For (, ), training was stopped after 74,000 (209,000; 853,000) training sequences, and then only 1 (0, 1) of the test sequences was not processed correctly. For , the number of required training examples varied between 370,000 and 2,020,000, exceeding 700,000 in only 3 cases.
The experiment demonstrates even for very long time lags: (1) LSTM is able to work well with distributed representations. (2) LSTM is able to perform calculations involving high-precision, continuous values. Such tasks are impossible to solve within reasonable time by other algorithms: the main problem of gradient-based approaches (including TDNN, pseudo Newton) is their inability to deal with very long minimal time lags (vanishing gradient). A main problem of ``global'' and ``discrete'' approaches (RS, Bengio's and Frasconi's EM-approach, discrete error propagation) is their inability to deal with high-precision, continuous values.