next up previous


The task is to observe and classify input sequences. There are two classes. There is only one input unit or input line. See, e.g., Bengio et al. (1994); Bengio and Frasconi (1994); Lin et al. (1995). The latch problem was designed to show how gradient descent fails. We tested two variants, both with 3 free parameters.

Latch variant I. The recurrent net itself has a single free parameter: the recurrent weight of a single tanh unit. The remaining 2 free parameters are the initial input values, one for each sequence class. The output targets at sequence end are +0.8 and -0.8. Gaussian noise with mean zero and variance 0.2 is added to each sequence element except the first. Hence a large positive recurrent weight is necessary to accomplish long-term storage of the bit of information identifying the class determined by the initial input. The latter's absolute value must be large to allow for latching the recurrent unit.

Results. RG solves the task within only 6 trials on average (mean of 40 simulations). This is better than the 1600 trials reported in Bengio et al. (1994) with several methods. RG's success is due to the few parameters and the fact that in search space $[-100,100]^3$ it almost suffices to get the parameter signs right (there are only 8 sign combinations).

Latch variant II. Sequences from class 1 start with 1.0; others with -1.0. The targets at sequence end are 1.0 and -1.0. The recurrent net has a single unit with tanh activation function. There are 3 incoming connections: one from itself, one from the input, and one bias connection (the inputs are not free parameters). Gaussian noise is added like with variant I.

Results. RG solves variant II within 22 trials on average. This is better than the 1600 trials reported by Bengio et al. (1994) with several methods on the simpler variant I.

2-sequence problem. Only the first $N$ real-valued sequence elements convey relevant information about the class (the inputs are not free parameters). Sequence elements at positions $t > N$ are generated by a Gaussian with mean zero and variance 0.2. Again we will consider two cases: $N=1$, and $N=3$ (used in Bengio et al. 1994). In the first case ($N=1$) we set the first sequence element to 1.0 for class 1, and -1.0 for class 2. The output neuron has a sigmoid activation function; the target at sequence end is 1.0 for class 1 and 0.0 for class 2. In case $N=3$ two 3-vectors in $[-1,1]^3$ are randomly chosen in advance -- they represent the initial subsequences determining the sequence class. $N>1$ was chosen because the probability of obtaining random initial patterns with significant differences increases with $N$.

Previous Results. The best method among the six tested by Bengio et al. (1994) solved the 2-sequence problem ($N$=3) after 6400 sequence presentations, with a final classification error rate of 0.06. In more recent work (1994), Bengio and Frasconi were able to improve their results: an EM-approach was reported to solve the problem within 2900 trials.

Results with RG. RG with architecture A2 and $N=1$ solves the problem within 718 trials on average. RG with A1 ($n=1$ in the architecture) requires 1247 trials, and reaches zero classification error on all trials.

With $N=3$, however, RG requires around 68000 trials on average to find a solution (mean of 40 simulations). If we ignore that RG trials need much less computation time than EM trials then EM seems faster. RG's difficulties stem from the continuous nature of the search space. With $N>1$, two tasks must be learned simultaneously: recognizing a multi-dimensional pattern (class 1 or 2), and latching it (storing one bit for the long term). The pattern recognition task is harder for RG.

next up previous
Juergen Schmidhuber 2003-02-19