Intuitive explanation of equation (2)

Next: Weak upper bound for Up: Exponential error decay Previous: Error path integral

Intuitive explanation of equation (2)

$\begin{displaymath}\left\vert f_{l_{\tau}}'(net_{l_{\tau}} (\tau)) \ \ w_{l_{\tau}l_{\tau-1}}\right\vert \ > \ 1.0\end{displaymath}$

for all $\tau$ then the largest product increases exponentially with

. That is, the error blows up, and conflicting error signals arriving at unit

can lead to oscillating weights and unstable learning (for error blow-ups or bifurcations see also [19,2,8]). On the other hand, if

$\begin{displaymath}\left\vert f_{l_{\tau}}'(net_{l_{\tau}} (\tau)) \ \ w_{l_{\tau}l_{\tau-1}}\right\vert \ < \ 1.0\end{displaymath}$

for all $\tau$ , then the largest product decreases exponentially with

. That is, the error vanishes, and nothing can be learned in acceptable time. If $f_{l_{\tau}}$ is the logistic sigmoid function, then the maximal value of $f_{l_{\tau}}'$ is 0.25. If $a_{l_{\tau-1}}$ is constant and not equal to zero, then the size of the gradient $\left\vert f_{l_{\tau}}'(net_{l_{\tau}})w_{l_{\tau}l_{\tau-1}}\right\vert$ takes on maximal values where

$\begin{displaymath}w_{l_{\tau}l_{\tau-1}}=\frac{1}{a_{l_{\tau-1}}} \coth\left(\frac{1}{2} net_{l_{\tau}}\right),\end{displaymath}$

the size of the derivative goes to zero for $\left\vert w_{l_{\tau}l_{\tau-1}}\right\vert \to \infty$ , and it is less than

for $\left\vert w_{l_{\tau}l_{\tau-1}}\right\vert<4.0$ (e.g., if the absolute maximal weight value $w_{max}$ is smaller than 4.0). Hence with conventional logistic sigmoid transfer functions, the error flow tends to vanish as long as the weights have absolute values below 4.0, especially in the beginning of the training phase. In general the use of larger initial weights does not help though -- as seen above, for $\left\vert w_{l_{\tau}l_{\tau-1}}\right\vert \to \infty$ the relevant derivative goes to zero ``faster'' than the absolute weight can grow (also, some weights may have to change their signs by crossing zero). Likewise, increasing the learning rate does not help either -- it does not change the ratio of long-range error flow and short-range error flow. BPTT is too sensitive to recent distractions. Note that since the summation terms in equation (2) may have different signs, increasing the number of units

does not necessarily increase error flow.

Next: Weak upper bound for Up: Exponential error decay Previous: Error path integral

Juergen Schmidhuber 2003-02-19