Weak upper bound for scaling factor

Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation

Weak upper bound for scaling factor

The following, slightly extended vanishing error analysis also takes

, the number of units, into account. For

, formula (2) can be rewritten as

$\begin{displaymath} \left(W^T_k\right)^T \ F'(t-1) \ \left( \prod_{\tau=t-2}^{s+1} \ W \ F'(\tau)\right) \ \ W_v \ f'_v\left(net_v(s)\right), \end{displaymath}$

where the weight matrix

is defined by $[W]_{ij}:=w_{ij}$ ,

's outgoing weight vector

is defined by $[W_v]_i:=[W]_{iv}=w_{iv}$ ,

's incoming weight vector

is defined by $[W^T_k]_i:=[W]_{ki}=w_{ki}$ , and

is the diagonal matrix of first order derivatives defined as: $[F'(t)]_{ij}:= 0$ if $i\not=j$ , and $[F'(t)]_{ij}:= f'_i(net_i(t))$ otherwise. Here

is the transposition operator, $[A]_{ij}$ is the element in the

-th column and

-th row of matrix

, and

is the

-th component of vector

. Using a matrix norm $\Vert . \Vert _A$ compatible with vector norm $\Vert . \Vert _x$ , we define

$\begin{displaymath}f'_{max}:= \mbox{max}_{\tau=t-1,\ldots,s} \{\Vert F'(\tau) \Vert _A \}.\end{displaymath}$

For $\mbox{max}_{i=1,\ldots,n} \{\left\vert x_i\right\vert\} \leq \ \Vert x \Vert _x$ we get $\left\vert x^T y\right\vert \ \leq \ n \ \Vert x \Vert _x \ \Vert y \Vert _x.$ Since

$\begin{displaymath}\left\vert f'_v(net_v(s))\right\vert \leq \ \Vert F'(s) \Vert _A \ \leq f'_{max},\end{displaymath}$

we obtain the following inequality:

$\begin{displaymath} \left\vert \frac{\partial \delta_{v} (s)} {\partial \delta... ... \leq \ n \ \left(f'_{max} \ \Vert W \Vert _A \right)^{t-s}. \end{displaymath}$

This inequality results from

$\begin{displaymath}\Vert W_v \Vert _x \ = \ \Vert W e_v \Vert _x \ \leq \ \Vert W \Vert _A \ \Vert e_v \Vert _x \ \leq \ \Vert W \Vert _A\end{displaymath}$

and

$\begin{displaymath}\Vert W^T_k \Vert _x \ = \ \Vert e_k W \Vert _x \ \leq \ \Vert W \Vert _A \ \Vert e_k \Vert _x \ \leq \ \Vert W \Vert _A,\end{displaymath}$

where

is the unit vector whose components are 0 except for the

-th component, which is 1. Note that this is a weak, extreme case upper bound -- it will be reached only if all $\Vert F'(\tau)\Vert _A$ take on maximal values, and if the contributions of all paths across which error flows back from unit

to unit

have the same sign. Large $\Vert W \Vert _A$ , however, typically result in small values of $\Vert F'(\tau)\Vert _A$ , as confirmed by experiments (see, e.g., [11]). For example, with norms

$\begin{displaymath}\Vert W \Vert _A \ := \mbox{max}_{r} \sum_{s} \left\vert w_{rs} \right\vert\end{displaymath}$

and

$\begin{displaymath}\Vert x \Vert _x := \mbox{max}_{r} \left\vert x_{r}\right\vert,\end{displaymath}$

we have $f'_{max}=0.25$ for the logistic sigmoid. We observe that if

$\begin{displaymath}\left\vert w_{ij}\right\vert \leq w_{max} < \frac{4.0}{n} \ \ \forall i,j,\end{displaymath}$

then $\Vert W \Vert _A \ \leq n \ \ w_{max} <4.0$ will result in exponential decay; by setting $\lambda :=\left( \frac{n w_{max}}{4.0} \right) < 1.0$ , we obtain

$\begin{displaymath}\left\vert \frac{\partial \delta_{v} (s)} {\partial \delta_{k} (t)} \right\vert \ \leq \ n \ \lambda^{t-s}.\end{displaymath}$

We refer to Hochreiter's thesis [11] for more details.

Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation

Juergen Schmidhuber 2003-02-19