Next: Exponential error decay
Up: Gradient Flow in Recurrent
Previous: Gradient Flow in Recurrent
Introduction
Recurrent networks (crossreference Chapter 12) can, in principle, use
their feedback connections to store representations of recent
input events in the form of activations.
The most widely used algorithms for learning
what to put in short-term memory,
however, take too
much time to be feasible or do not work well at all, especially when
minimal time lags between inputs and
corresponding teacher signals are long.
Although theoretically fascinating, they do not provide
clear practical advantages over, say, backprop in feedforward
networks with limited time windows (see crossreference
Chapters 11 and 12).
With conventional
``algorithms based on the computation of
the complete gradient'', such as
``Back-Propagation Through Time'' (BPTT, e.g.,
[22,27,26]) or
``Real-Time Recurrent Learning''
(RTRL, e.g., [21])
error signals ``flowing backwards in time''
tend to either (1) blow up or (2) vanish:
the temporal evolution of
the backpropagated error exponentially depends
on the size of the weights [11,6].
Case (1) may lead to oscillating weights, while in case (2)
learning to bridge long time lags takes a prohibitive amount
of time, or does not work at all.
In what follows, we give a theoretical analysis of this problem by
studying the asymptotic behavior of error gradients as a function of
time lags. In Section 2, we consider the case of
standard RNNs and derive the main result using the approach
first proposed in [11]. In Section 3, we
consider the more general case of adaptive dynamical systems, which
include, besides standard RNNs, other recurrent architectures based on
different connectivities and choices of the activation function (e.g.,
RBF or second order connections). Using
the analysis reported in [6] we show that one of the
following two undesirable situations necessarily arise: either the
system is unable to robustly store past information about its inputs,
or gradients vanish exponentially. Finally, in Section
4 we shortly review alternative optimization methods
and architectures that have been suggested to improve
learning in the presence of long-term dependencies.
Next: Exponential error decay
Up: Gradient Flow in Recurrent
Previous: Gradient Flow in Recurrent
Juergen Schmidhuber
2003-02-19