   Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation

### Weak upper bound for scaling factor

The following, slightly extended vanishing error analysis also takes , the number of units, into account. For , formula (2) can be rewritten as where the weight matrix is defined by , 's outgoing weight vector is defined by , 's incoming weight vector is defined by , and is the diagonal matrix of first order derivatives defined as: if , and otherwise. Here is the transposition operator, is the element in the -th column and -th row of matrix , and is the -th component of vector . Using a matrix norm compatible with vector norm , we define For we get Since we obtain the following inequality: This inequality results from and where is the unit vector whose components are 0 except for the -th component, which is 1. Note that this is a weak, extreme case upper bound -- it will be reached only if all take on maximal values, and if the contributions of all paths across which error flows back from unit to unit have the same sign. Large , however, typically result in small values of , as confirmed by experiments (see, e.g., ). For example, with norms and we have for the logistic sigmoid. We observe that if then will result in exponential decay; by setting , we obtain We refer to Hochreiter's thesis  for more details.   Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation
Juergen Schmidhuber 2003-02-19