Traditional recurrent nets fail in case of long minimal time lags between input signals and corresponding error signals [7,3]. Many recent papers propose alternative methods, e.g., [16,12,1,6,9]. For instance, Bengio et al. investigate methods such as simulated annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete error propagation . They also propose an EM approach . Quite a few papers use variants of the ``latch problem'' (and ``2-sequence problem'') to show the proposed algorithms's superiority, e.g. [3,1,6,9]. Some papers also use the ``parity problem'', e.g., [3,1]. Some of Tomita's  grammars are also often used as benchmark problems for recurrent nets [2,19,14,11].
Trivial versus non-trivial tasks. By our definition, a ``trivial'' task is one that can be solved quickly by random search (RS) in weight space. RS works as follows: REPEAT randomly initialize the weights and test the resulting net on a training set UNTIL solution found.
Random search (RS) details. In all our RS experiments, we randomly initialize weights in [-100.0,100.0]. Binary inputs are -1.0 (for 0) and 1.0 (for 1). Targets are either 1.0 or 0.0. All activation functions are logistic sigmoid in [0.0,1.0]. We use two architectures (A1, A2) suitable for many widely used ``benchmark'' problems: A1 is a fully connected net with 1 input, 1 output, and biased hidden units. A2 is like A1 with , but less densely connected: each hidden unit sees the input unit, the output unit, and itself; the output unit sees all other units; all units are biased. All activations are set to 0 at each sequence begin. We will indicate where we also use different architectures of other authors. All sequence lengths are randomly chosen between 500 and 600 (most other authors facilitate their problems by using much shorter training/test sequences). The ``benchmark'' problems always require to classify two types of sequences. Our training set consists of 100 sequences, 50 from class 1 (target 0) and 50 from class 2 (target 1). Correct sequence classification is defined as ``absolute error at sequence end below 0.1''. We stop the search once a random weight matrix correctly classifies all training sequences. Then we test on the test set (100 sequences). All results below are averages of 10 trials. In all our simulations below, RS finally classified all test set sequences correctly; average final absolute test set errors were always below 0.001 -- in most cases below 0.0001.
``2-sequence problem'' (and ``latch problem'') [3,1,9]. The task is to observe and classify input sequences. There are two classes. There is only one input unit or input line. Only the first N real-valued sequence elements convey relevant information about the class. Sequence elements at positions (we use ) are generated by a Gaussian with mean zero and variance 0.2. The first sequence element is 1.0 (-1.0) for class 1 (2). Target at sequence end is 1.0 (0.0) for class 1 (2) (the latch problem is a simple version of the 2-sequence problem that allows for input tuning instead of weight tuning).
Bengio et al.'s results. For the 2-sequence problem, the best method among the six tested by Bengio et al.  was multigrid random search (sequence lengths 50 -- 100; and stopping criterion undefined), which solved the problem after 6,400 sequence presentations, with final classification error 0.06. In more recent work, Bengio and Frasconi were able to improve their results: an EM-approach  was reported to solve the problem within 2,900 trials.
RS results. RS with architecture A2 (A1, ) solves the problem within only 718 (1247) trials on average. Using Bengio et al.'s architecture for the latch problem  (only 3 parameters), the problem was solved within only 22 trials on average, due to tiny parameter space. According to our definition above, the problem is trivial. RS outperforms Bengio et al.'s methods in every respect: (1) many fewer trials required, (2) much less computation time per trial. Also, in most cases (3) the solution quality is better (less error).
It should be mentioned, however, that different input representations and different types of noise may lead to worse RS performance (Yoshua Bengio, personal communication, 1996).
``Parity problem''. The parity task [3,1] requires to classify sequences with several 100 elements (only 1's or -1's) according to whether the number of 1's is even or odd. The target at sequence end is 1.0 for odd and 0.0 for even.
Bengio et al.'s results. For sequences with only 25-50 steps, among the six methods tested in  only simulated annealing was reported to achieve final classification error of 0.000 (within about 810,000 trials -- the authors did not mention the precise stopping criterion). A method called ``discrete error BP'' took about 54,000 trials to achieve final classification error 0.05. In more recent work , for sequences with 250-500 steps, their EM-approach took about 3,400 trials to achieve final classification error 0.12.
RS results. RS with A1 () solves the problem within only 2906 trials on average. RS with A2 solves it within 2797 trials. We also ran another experiment with architecture A2, but without self-connections for hidden units. RS solved the problem within 250 trials on average.
Again it should be mentioned that different input representations and noise types may lead to worse RS performance (Yoshua Bengio, personal communication, 1996).
Tomita grammars. Many authors also use Tomita's grammars  to test their algorithms. See, e.g., [2,19,14,11,10]. Since we already tested parity problems above, we now focus on a few ``parity-free'' Tomita grammars (nr.s #1, #2, #4). Previous work facilitated the problems by restricting sequence length. E.g., in , maximal test (training) sequence length is 15 (10). Reference  reports the number of sequences required for convergence (for various first and second order nets with 3 to 9 units): Tomita #1: 23,000 - 46,000; Tomita #2: 77,000 - 200,000; Tomita #4: 46,000 - 210,000. RS, however, clearly outperforms the methods in . The average results are: Tomita #1: 182 (A1, ) and 288 (A2), Tomita #2: 1,511 (A1, ) and 17,953 (A2), Tomita #4: 13,833 (A1, ) and 35,610 (A2).
Non-trivial tasks / Outline of remainder. The experiments in the remainder of this paper deal with non-trivial tasks whose solutions are sparse in weight space. They require either many free parameters (e.g., input weights) or high weight precision, such that RS becomes completely infeasible. All experiments involve long minimal time lags -- there are no short time lag training exemplars facilitating learning. To solve these tasks, however, we need a novel method called ``Long Short-Term Memory'', or LSTM for short . Section 3 will briefly review LSTM. Section 4 will present new results on tasks that cannot be solved at all by any other recurrent net learning algorithm we are aware of. LSTM can solve rather complex long time lag problems involving distributed, high-precision, continuous-valued representations, and is able to extract information conveyed by the temporal order of widely separated inputs.