Architectural issues. RG will not work well in case of complex architectures with many free parameters except when solutions are rather dense in weight space. In fact, RG-tests are useful for discovering whether this is the case or not. If so, then the problem will not be a convincing benchmark. Our experiments with A2 (which has comparatively many hidden units and free parameters) indeed provide examples of such cases.
A problem solvable by RG on a certain architecture, however, might still be a useful benchmark for a different architecture.
Flat minima. Successful RG typically hits ``flat minima'' of the error function [Hochreiter and Schmidhuber, 1997b], because flat minima correspond to fat maxima of the posterior. Actually they correspond to large regions in weight space where solutions are dense in the mathematical sense. In other words: RG works well on problems where the precise weight values do not matter -- and flat minima precisely correspond to low-precision weights.
Initialization. RG's success depends on sufficient weight initialization intervals. For instance, given a particular architecture, the intervals [-0.1,0.1] and [-5,5] may lead to quite different search results. (Of course, the success of algorithms other than RG also heavily depends on their parameter initialization intervals, and on the length of the time intervals between re-initializations.)
Feedforward nets. It should also be mentioned that solutions to many well-known, simple, nontemporal tasks such as XOR can be guessed within less than 100 trials on numerous standard feedforward architectures. Compare, for instance, Gallant (1990).
Limitations of RG. Of course, we do not intend to say that RG is a good algorithm -- it is just a reasonable first step towards benchmark evaluation. We would never use RG in realistic applications. Realistic tasks require either many free parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), such that RG becomes completely infeasible. For example, Schmidhuber's task (1992) requires about 100 different input units and therefore too many input weights to be guessed within reasonable time.
Non-neural approaches. Some of the tasks mentioned above may be easily solvable by non-neural methods such as Fu's and Booth's (1975) or Lang's (1992), perhaps even much faster than by RG. The comparisons in this paper, however, are limited to various recent long time lag algorithms for neural nets.
Improving neural approaches. We are aware of two neural methods that have been successfully applied to long time lag tasks that RG cannot solve in reasonable time due to too many free parameters: the sequence chunker [Schmidhuber, 1992] and Long Short-Term Memory (LSTM -- Hochreiter and Schmidhuber 1997a, 1997c). The chunker's applicability is limited to compressible sequences, LSTM's is not. LSTM eliminates some of gradient-based approaches' problems and can solve complex long time lag tasks involving distributed, high-precision, continuous-valued representations. Such tasks cannot be quickly solved by RG nor by any other method we are aware of. Other interesting long time lag approaches have been presented in papers by El Hihi and Bengio (1995) and Lin et al. (1995). They suggest mechanisms such as long time delays in recurrent connections to allow for handling proportionally longer temporal dependencies. Some ideas in these papers are also related to the ideas of handling long-term dependencies by introducing multiple time scales or a hierarchy of state variables [Mozer, 1992,Schmidhuber, 1992,Saul and Jordan, 1996,Jordan et al., 1997].