next up previous


The experiment will show that LSTM can solve non-trivial, complex long time lag problems involving distributed, high-precision, continuous-valued representations.

Task. Each element of each input sequence is a pair consisting of two components. The first component is a real value randomly chosen from the interval $[-1,1]$. The second component is either 1.0, 0.0, or -1.0, and is used as a marker: at the end of each sequence, the task is to output the sum of the first components of those pairs that are marked by second components equal to 1.0. The value $T$ is used to determine average sequence length, which is a randomly chosen integer between $T$ and $T+\frac{T}{10}$. With a given sequence, exactly two pairs are marked as follows: we first randomly select and mark one of the first ten pairs (whose first component is called $X_1$). Then we randomly select and mark one of the first $\frac{T}{2}-1$ still unmarked pairs (whose first component is called $X_2$). The second components of the remaining pairs are zero except for the first and final pair, whose second components are -1 ($X_1$ is set to zero in the rare case where the first pair of the sequence got marked). An error signal is generated only at the sequence end: the target is $0.5 + \frac{X_1 + X_2}{4.0}$ (the sum $X_1+X_2$ scaled to the interval $[0,1]$). A sequence was processed correctly if the absolute error at the sequence end is below 0.04.

Architecture. We use a 3-layer net with 2 input units, 1 output unit, and 2 memory cell blocks of size 2 (a cell block size of 1 works well, too). The output layer receives connections only from memory cells. Memory cells/ gate units receive inputs from memory cells/gate units (fully connected hidden layer).

State drift versus initial bias. Note that the task requires to store the precise values of real numbers for long durations -- the system must learn to protect memory cell contents against even minor ``internal state drifts''. Our simple but highly effective way of solving drift problems at the beginning of learning is to initially bias the input gate $in_j$ towards zero. There is no need for fine tuning initial bias: with sigmoid logistic activation functions, the precise initial bias hardly matters because vastly different initial bias values produce almost the same near-zero activations. In fact, the system itself learns to generate the most appropriate input gate bias. To study the significance of the drift problem, we bias all non-input units, thus artificially inducing internal state drifts. Weights (including bias weights) are randomly initialized in the range $[-0.1,0.1]$. The first (second) input gate bias is initialized with $-3.0$ ($-6.0$) (recall that the precise initialization values hardly matters, as confirmed by additional experiments).

Training / Testing. The learning rate is 0.5. Training examples are generated on-line. Training is stopped if the average training error is below 0.01, and the 2000 most recent sequences were processed correctly (see definition above).

Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.01, and there were never more than 3 incorrectly processed sequences. The following results are means of 10 trials: For $T=100$ ($T=500$, $T=1000$), training was stopped after 74,000 (209,000; 853,000) training sequences, and then only 1 (0, 1) of the test sequences was not processed correctly. For $T=1000$, the number of required training examples varied between 370,000 and 2,020,000, exceeding 700,000 in only 3 cases.

The experiment demonstrates even for very long time lags: (1) LSTM is able to work well with distributed representations. (2) LSTM is able to perform calculations involving high-precision, continuous values. Such tasks are impossible to solve within reasonable time by other algorithms: the main problem of gradient-based approaches (including TDNN, pseudo Newton) is their inability to deal with very long minimal time lags (vanishing gradient). A main problem of ``global'' and ``discrete'' approaches (RS, Bengio's and Frasconi's EM-approach, discrete error propagation) is their inability to deal with high-precision, continuous values.

next up previous
Juergen Schmidhuber 2003-02-25

Back to Recurrent Neural Networks page