Next: Dilemma: Avoiding gradient decay
Up: Exponential error decay
Previous: Intuitive explanation of equation
The following, slightly extended vanishing error analysis
also takes , the number of units, into account.
For , formula (2) can be rewritten as
where the weight matrix is defined by
, 's outgoing weight vector is defined by
,
's incoming weight vector is defined by
, and
is the diagonal matrix of first order
derivatives defined as:
if , and
otherwise.
Here is the transposition operator,
is the element in the -th column and -th row of
matrix , and is the -th component of vector .
Using a matrix norm
compatible with vector norm ,
we define
For
we get
Since
we obtain the following inequality:
This inequality results from
and
where is the unit vector whose components are 0 except
for the -th component, which is 1.
Note that this is a weak, extreme case upper bound -- it will
be reached only if
all
take on maximal values,
and if the contributions of all paths across which error flows back from
unit
to unit have the same sign.
Large , however, typically result in
small values of
, as confirmed by
experiments (see, e.g., [11]).
For example, with
norms
and
we have for the logistic sigmoid.
We observe that
if
then
will result in exponential decay; by setting
,
we obtain
We refer to Hochreiter's thesis [11] for more details.
Next: Dilemma: Avoiding gradient decay
Up: Exponential error decay
Previous: Intuitive explanation of equation
Juergen Schmidhuber
2003-02-19