next up previous
Next: Exponential error decay Up: Gradient Flow in Recurrent Previous: Gradient Flow in Recurrent


Introduction

Recurrent networks (crossreference Chapter 12) can, in principle, use their feedback connections to store representations of recent input events in the form of activations. The most widely used algorithms for learning what to put in short-term memory, however, take too much time to be feasible or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, they do not provide clear practical advantages over, say, backprop in feedforward networks with limited time windows (see crossreference Chapters 11 and 12). With conventional ``algorithms based on the computation of the complete gradient'', such as ``Back-Propagation Through Time'' (BPTT, e.g., [22,27,26]) or ``Real-Time Recurrent Learning'' (RTRL, e.g., [21]) error signals ``flowing backwards in time'' tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights [11,6]. Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all. In what follows, we give a theoretical analysis of this problem by studying the asymptotic behavior of error gradients as a function of time lags. In Section 2, we consider the case of standard RNNs and derive the main result using the approach first proposed in [11]. In Section 3, we consider the more general case of adaptive dynamical systems, which include, besides standard RNNs, other recurrent architectures based on different connectivities and choices of the activation function (e.g., RBF or second order connections). Using the analysis reported in [6] we show that one of the following two undesirable situations necessarily arise: either the system is unable to robustly store past information about its inputs, or gradients vanish exponentially. Finally, in Section 4 we shortly review alternative optimization methods and architectures that have been suggested to improve learning in the presence of long-term dependencies.
next up previous
Next: Exponential error decay Up: Gradient Flow in Recurrent Previous: Gradient Flow in Recurrent
Juergen Schmidhuber 2003-02-19