next up previous
Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation

Weak upper bound for scaling factor

The following, slightly extended vanishing error analysis also takes $n$, the number of units, into account. For $t-s>1$, formula (2) can be rewritten as

\begin{displaymath}
\left(W^T_k\right)^T \ F'(t-1) \ \left( \prod_{\tau=t-2}^{s+1} \ W \ F'(\tau)\right) \ \
W_v \ f'_v\left(net_v(s)\right),
\end{displaymath}

where the weight matrix $W$ is defined by $[W]_{ij}:=w_{ij}$, $v$'s outgoing weight vector $W_v$ is defined by $[W_v]_i:=[W]_{iv}=w_{iv}$, $k$'s incoming weight vector $W^T_k$ is defined by $[W^T_k]_i:=[W]_{ki}=w_{ki}$, and $F'(t)$ is the diagonal matrix of first order derivatives defined as: $[F'(t)]_{ij}:= 0$ if $i\not=j$, and $[F'(t)]_{ij}:= f'_i(net_i(t))$ otherwise. Here $T$ is the transposition operator, $[A]_{ij}$ is the element in the $i$-th column and $j$-th row of matrix $A$, and $[x]_i$ is the $i$-th component of vector $x$. Using a matrix norm $\Vert . \Vert _A$ compatible with vector norm $\Vert . \Vert _x$, we define

\begin{displaymath}f'_{max}:= \mbox{max}_{\tau=t-1,\ldots,s}
\{\Vert F'(\tau) \Vert _A \}.\end{displaymath}

For $ \mbox{max}_{i=1,\ldots,n} \{\left\vert x_i\right\vert\} \leq \
\Vert x \Vert _x$ we get $\left\vert x^T y\right\vert
\ \leq \ n \ \Vert x \Vert _x \ \Vert y \Vert _x.$ Since

\begin{displaymath}\left\vert f'_v(net_v(s))\right\vert \leq \ \Vert F'(s) \Vert _A \ \leq f'_{max},\end{displaymath}

we obtain the following inequality:

\begin{displaymath}
\left\vert \frac{\partial \delta_{v} (s)}
{\partial \delta...
... \leq \
n \ \left(f'_{max} \ \Vert W \Vert _A \right)^{t-s}.
\end{displaymath}

This inequality results from

\begin{displaymath}\Vert W_v \Vert _x \ = \
\Vert W e_v \Vert _x \ \leq \
\Vert W \Vert _A \ \Vert e_v \Vert _x \ \leq \
\Vert W \Vert _A\end{displaymath}

and

\begin{displaymath}\Vert W^T_k \Vert _x \ = \
\Vert e_k W \Vert _x \ \leq \
\Vert W \Vert _A \ \Vert e_k \Vert _x \ \leq \
\Vert W \Vert _A,\end{displaymath}

where $e_k$ is the unit vector whose components are 0 except for the $k$-th component, which is 1. Note that this is a weak, extreme case upper bound -- it will be reached only if all $\Vert F'(\tau)\Vert _A$ take on maximal values, and if the contributions of all paths across which error flows back from unit $k$ to unit $v$ have the same sign. Large $\Vert W \Vert _A$, however, typically result in small values of $\Vert F'(\tau)\Vert _A$, as confirmed by experiments (see, e.g., [11]). For example, with norms

\begin{displaymath}\Vert W \Vert _A \ :=
\mbox{max}_{r} \sum_{s} \left\vert w_{rs} \right\vert\end{displaymath}

and

\begin{displaymath}\Vert x \Vert _x := \mbox{max}_{r} \left\vert x_{r}\right\vert,\end{displaymath}

we have $f'_{max}=0.25$ for the logistic sigmoid. We observe that if

\begin{displaymath}\left\vert w_{ij}\right\vert \leq w_{max} < \frac{4.0}{n} \ \ \forall i,j,\end{displaymath}

then $\Vert W \Vert _A \ \leq n \ \ w_{max} <4.0$ will result in exponential decay; by setting $\lambda :=\left( \frac{n w_{max}}{4.0} \right) < 1.0$, we obtain

\begin{displaymath}\left\vert \frac{\partial \delta_{v} (s)}
{\partial \delta_{k} (t)} \right\vert \ \leq \ n
\ \lambda^{t-s}.\end{displaymath}

We refer to Hochreiter's thesis [11] for more details.
next up previous
Next: Dilemma: Avoiding gradient decay Up: Exponential error decay Previous: Intuitive explanation of equation
Juergen Schmidhuber 2003-02-19