Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation

### Weak upper bound for scaling factor

The following, slightly extended vanishing error analysis also takes , the number of units, into account. For , formula (2) can be rewritten as

where the weight matrix is defined by , 's outgoing weight vector is defined by , 's incoming weight vector is defined by , and is the diagonal matrix of first order derivatives defined as: if , and otherwise. Here is the transposition operator, is the element in the -th column and -th row of matrix , and is the -th component of vector . Using a matrix norm compatible with vector norm , we define

For we get Since

we obtain the following inequality:

This inequality results from

and

where is the unit vector whose components are 0 except for the -th component, which is 1. Note that this is a weak, extreme case upper bound -- it will be reached only if all take on maximal values, and if the contributions of all paths across which error flows back from unit to unit have the same sign. Large , however, typically result in small values of , as confirmed by experiments (see, e.g., [11]). For example, with norms

and

we have for the logistic sigmoid. We observe that if

then will result in exponential decay; by setting , we obtain

We refer to Hochreiter's thesis [11] for more details.

Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation
Juergen Schmidhuber 2003-02-19