**Architectural issues.**
RG will not work well in
case of complex architectures with many free parameters
except when solutions are rather dense in weight space.
In fact, RG-tests are useful for discovering whether this
is the case or not.
If so, then the problem will not be a convincing benchmark.
Our experiments with A2 (which has comparatively many
hidden units and free parameters) indeed provide examples
of such cases.

A problem solvable by RG on a certain architecture, however, might still be a useful benchmark for a different architecture.

**Flat minima.**
Successful RG typically hits ``flat minima''
of the error function
[Hochreiter and Schmidhuber, 1997b],
because flat minima correspond to fat maxima
of the posterior.
Actually they correspond to large regions
in weight space where solutions are dense
in the mathematical sense. In other words:
RG works well on problems where the precise
weight values do not matter -- and flat minima
precisely correspond to low-precision weights.

**Initialization.**
RG's success depends on
sufficient weight initialization intervals.
For instance, given a particular architecture,
the intervals [-0.1,0.1] and [-5,5]
may lead to quite different search results.
(Of course, the success of algorithms other than RG
also heavily depends on their parameter initialization
intervals, and on the length of the time intervals
between re-initializations.)

**Feedforward nets.**
It should also be mentioned
that solutions to many well-known, simple, *nontemporal*
tasks such as XOR can be guessed within less than
100 trials on numerous standard feedforward architectures.
Compare, for instance, Gallant (1990).

**Limitations of RG.**
Of course, we do not intend to say that RG is a good
algorithm -- it is just a reasonable first step towards
benchmark evaluation. We would never use RG in realistic applications.
Realistic tasks require either
many free parameters (e.g., input weights)
or high weight precision (e.g., for continuous-valued parameters),
such that RG becomes completely infeasible.
For example, Schmidhuber's task (1992) requires about
100 different input units and therefore too many input
weights to be guessed within reasonable time.

**Non-neural approaches.**
Some of the tasks mentioned above
may be easily solvable by non-neural methods
such as Fu's and Booth's (1975) or
Lang's (1992),
perhaps even much faster than by RG.
The comparisons
in this paper, however, are limited to various
recent long time lag algorithms for neural nets.

**Improving neural approaches.**
We are aware of two neural methods that
have been successfully applied to long time lag tasks
that RG cannot solve in reasonable time due to too many free
parameters: the *sequence chunker* [Schmidhuber, 1992]
and *Long Short-Term Memory*
(LSTM -- Hochreiter and Schmidhuber 1997a, 1997c).
The chunker's applicability is limited to compressible sequences,
LSTM's is not. LSTM eliminates some of
gradient-based approaches' problems and can
solve complex long time lag tasks involving
distributed, high-precision, continuous-valued representations.
Such tasks cannot be quickly solved by RG
nor by any other method we are aware of.
Other interesting long time lag approaches have been presented in papers by
El Hihi and Bengio (1995) and Lin et al. (1995). They
suggest mechanisms such as long time delays
in recurrent connections to allow for handling proportionally longer
temporal dependencies. Some ideas in these papers are also related
to the ideas of handling long-term dependencies by introducing
multiple time scales or a hierarchy of state
variables [Mozer, 1992,Schmidhuber, 1992,Saul and Jordan, 1996,Jordan et al., 1997].