next up previous
Next: Acknowledgments Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.3.2. FAST MULTIPLICATION BY

A.4. PSEUDO CODE OF THE ALGORITHM

Below the algorithm in pseudo code (using fast multiplication as in appendix A.3.2). Comments are marked by ``**''. Note: the pseudo code was omitted from the version for Neural Computation. We recommend not to blindly reimplement the algorithm from the pseudo code, but to make a serious effort to understand it, by consulting the body of the paper as well. We believe that this will greatly facilitate proper, problem-specific use of the algorithm.

Notation. In what follows, the variable integers $i$ and $j$ stand for units.
Variables $k,m,k_1$ are reserved for output units only.
$g$ is the number of layers, where the $g$th layer is the output layer and the $1$st layer is the input layer.
The current pattern consists of input vector $x$ and target vector $t$ (see section 2).
$x[j]$ is the component of the input vector corresponding to input unit $j$.
$t[k]$ is the component of the output vector corresponding to output unit $k$.
$w[i][j]$ is the real-valued weight on the connection from unit $j$ to unit $i$ (see $w_{ij}$ in section 2).
$s[j]$ is the net input of unit $j$ (see equation (34) and text thereafter).
$f_j$ is the activation function of unit $j$, $f_j'$ is the first derivative, $f_j''$ is the second derivative of $f_j$ (see appendix A.2).
$y[j]$ is the activation of the $j$-th unit (see appendix A.2).
$error[k]$ is the error of the $k$-th output unit.
$ky[k][j]$ is $\frac{\partial y[k]}{\partial y[j]}$ (see equation (43)).
$ks[k][j]$ is $\frac{\partial y[k]}{\partial s[j]}$ (see equation (44)).
$ykw[k][i][j]$ is $\frac{\partial y[k]}{\partial w[i][j]}$ (see equation (45)).
$yw[i][j]$ is $\frac{\partial E}{\partial w[i][j]} = \nabla_w E$, the gradient of the quadratic error.
abs$(x)$ denotes the absolute value of real $x$.
kron$(m=k)$ returns $1$ if $m=k$ and $0$ otherwise.
sign$(x)$ returns the sign of real $x$.
$t1[][], t2[], t3, t4$ are variables (used to compute the right hand side of equation (40)).
$t1[i][j] = \sum_{k  \mbox{\footnotesize is output unit}}
\left(\frac{\partial ...
...[j]}\right)^2 =
\sum_{k  \mbox{\footnotesize is output unit}} (ykw[k][i][j])^2$.
$t2[k] = \sum_{w[i][j]} \frac{\mbox{\footnotesize abs}\left(\frac{\partial y[k]}...
...[j]} \frac{\mbox{\footnotesize abs}\left( ykw[k][i][j]\right)}{\sqrt{t1[i][j]}}$ (see the sums over $l,r$ in (40)).
$t3 =
\sum_{k  \mbox{\footnotesize is output unit}}
\left( \sum_{w[i][j]} \fra...
...ight)^2
}}\right)^2 =
\sum_{k  \mbox{\footnotesize is output unit}} (t2[k])^2$ (see the denumerator in the second line of equation (40)).
$t4 =
\sum_{m  \mbox{\footnotesize is output unit}}  \
\sum_{w[l][r]} \frac{...
...kron}(m=k) t1[i][j] - ykw[m][i][j]  ykw[k][i][j]}
{ (t1[i][j])^{\frac{3}{2}} }$ (see the numerator in the second line of equation (40)).

$weights$ stands for the number of weights that make a significant contribution to the computation of the current pattern.
$\delta w[i][j]$ is an approximation of $w[i][j]$'s precision (approximation because $\epsilon$ is unknown).
$insignificant[i][j]$ marks whether or not $w[i][j]$ provides a significant contribution to the computation of the current pattern.
$b[k][i][j]$ is $\frac{\partial B}{\partial
\left( \frac{\partial y[k]}{\partial w[i][j]} \right)}$ (see equation (40)).
$rs[k][i]$ is $R\{s[i]\}$ (see equation (46)).
$ry[k][i]$ is $R\{y[i]\}$ (see equation (47)).
$rdks[i]$ is $R\{\frac{\partial y[k]}{\partial s[i]}\}$ (see equation (49)).
$rdky[i]$ is $R\{\frac{\partial y[k]}{\partial y[i]}\}$ (see equation (48)).
$rdw[i][j]$ is $R\{\frac{\partial y[k]}{\partial w[i][j]}\}=\nabla_w B =
\sum_{k} H^{k} (\nabla_{\frac{\partial o^{k}}{\partial w_{ij}}} B)$, the gradient of the additional error term $B$ (see equation (50)).
$E$ is the current pattern's quadratic error (see section 2 and appendix A.1).
$\alpha$ is the learning rate for the quadratic error.
$\lambda$ is the learning rate for the additional error term (the following values are used to make $\lambda$ updates according to Weigend et al., 1991, see also section 5.6).
$\Delta \lambda$ is a parameter needed for updating $\lambda$.
$\sigma$ is a parameter needed for updating $\lambda$, typical value is 0.5.
$E_o$ is the most recent epoch's average error.
$E_n$ is the current epoch's average error.
$E_a$ is the exponentially weighted average epoch error.
$\gamma$ is the parameter for the exponentially weighted error -- a typical value is 0.9 or 0.99, depending on training set size.
$E_{tol}$ is the tolerable error level - it depends on the task to be solved (see section 2 and appendix A.1, equation (13)).
$exemplars$ is the number of exemplars observed during training.
$epochlength$ is the length of an epoch, measured in number of presented training patterns.
$epochs = exemplars  \%  epochlength$ is the current number of epochs so far, where $\%$ represents integer division.
$lyw, lrdw, scale$ are variables required for normalizing $B$' gradient to the length of $E$'s gradient.
$wasalive[i][j]$ is TRUE if $w[i][j]$ was alive for at least one pattern presented during the previous epoch, and FALSE otherwise.
$alive[i][j]$ is TRUE if $wasalive[i][j]$ is TRUE and if $alive[i][j]$ was always TRUE during all epochs since $w[i][j]$ was set alive for the last time (otherwise $alive[i][j]$ is FALSE).
$ \leftarrow $ denotes the assignment operator.

Additional comments.

Speeding up the algorithm. It makes sense to separate the algorithm into two phases. Phase 1 is conventional backprop, phase 2 is FMS. The backprop phase consists of the forward pass, the first backward pass, and the weight update based on $\lambda =0$ (marked in the algorithm). Start with phase 1. Switch to phase 2 if $E_a < 0.9  E_{tol}$. Switch again to phase 1 if $E_a > 1.1  E_{tol}$ (the values $0.9$ and $1.1$ can be changed).

Two-phase learning does not sensitively depend on $\lambda$ (but avoid $\lambda$ values that are always too small). Two-phase learning is justified because weights with large $\delta w[i][j]$ (and small $\frac{\partial E}{\partial w[i][j]}$) hardly influence $E$'s gradient -- it makes sense to let FMS focus on low-precision weights, and let backprop focus on the others.




ALGORITHM.


Initialization. Set $K= 10^2$ or $K = 10^3$ (the exponent is the difference between the numbers of significant digits required for maximal and minimal precision).
Set $\epsilon$ to an arbitrary small value.
Set $exemplars$ and $epochs$ to 0.
Initialize $w[i][j]$ for all $i,j$.
Initialize $\lambda,\alpha$ (typically, $\lambda =0$), and provide a criterion for stopping the learning procedure.
Set $\gamma,\sigma,\Delta \lambda$, for instance, $\gamma = 0.9$ or $0.99, \sigma = 0.5, \Delta \lambda = 0.01  \alpha$ or $\Delta \lambda = 0.001  \alpha$.
Set $E_{tol}$ to some desired error value after learning.
Set $E, E_a, E_o, E_n$ to 0.
Set $epochlength$.
Set $exemplars, epochs$ to 0.
Set $alive[i][j] =$ TRUE for all $i,j$.
Set $wasalive[i][j] =$ FALSE for all $i,j$.
Set $\delta_{min}$ to a large value.
Set $alive[i][j] =$ FALSE for non-existing $w[i][j]$.




While training not stopped do
begin(while)

select pattern pair $(x, t)$.
** The following variables can be set after they were used for the last time in the previous loop. **
set all components of $s[], yw[][], ky[][], t1[][], t2[],
rs[][], rdw[][], rdky[]$ to 0
set all components of $insignificant[][]$ to FALSE
set $t3, t4, E$ to 0

** Forward pass. **

for all input units $j$ do
begin
$\mid$         $y[j] \leftarrow x[j]$
end

for $u=2$ to $g$ do
begin
$\mid$        for all units $i$ in layer $u$ do
$\mid$        begin
$\mid$        $\mid$        for all units $j$ in layer $u-1$ do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        if ($alive[i][j]$) do
$\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$         $s[i] \leftarrow s[i]+ w[i][j]  y[j]$
$\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        end
$\mid$        $\mid$         $y[i] \leftarrow f_i(s[i])$
$\mid$        end
end

** Compute the error. **

for all output units $k$ do
begin
$\mid$         $error[k] \leftarrow t[k] - y[k]$
$\mid$         $E \leftarrow E + (error[k])^2$
end

$E_n \leftarrow E_n + E$

** Compute the error gradient **

** 1. backward pass. **

for all output units $k$ do
begin
$\mid$         $ks[k][k] \leftarrow f_k'(s[k])$
$\mid$        for $u=1$ to $g-1$ do
$\mid$        begin
$\mid$        $\mid$        for all units $j$ in layer $g-u$ do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        (IF $u\not=1$ THEN: for all units $i$ in layer $g-u+1$ ELSE: $i=k$) do
$\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$        if ($alive[i][j]$) do
$\mid$        $\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$         $ykw[k][i][j] \leftarrow y[j]  ks[k][i]$
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$        set abs $(ykw[k][i][j]) >$ 1E-5 ** to avoid division overflow **
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$         $yw[i][j] \leftarrow yw[i][j] + ykw[k][i][j]  error[k]$
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$         $ky[k][j] \leftarrow ky[k][j] + w[i][j]  ks[k][i]$
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$         $t1[i][j] \leftarrow t1[i][j] + (ykw[k][i][j])^2$
$\mid$        $\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        $\mid$         $ks[k][j] \leftarrow f_j'(s[j]) ky[k][j]$
$\mid$        $\mid$        end
$\mid$        end
end

** End of conventional backprop (phase 1). **

** compute $b[k][i][j] = \frac{\partial B}{\partial
\left( \frac{\partial o^k}{\partial
w_{ij}} \right)}$ **

** we recommend to introduce additional local variables for inner loops, such as $h1= \sqrt{t1[i][j]}$ and $h2=ykw[k][i][j]  /  t1[i][j]$ **

for all output units $k$ do
begin
$\mid$        for all $i,j$, such that ($alive[i][j]$) do
$\mid$        begin
$\mid$        $\mid$         $t2[k] \leftarrow t2[k] + \mbox{abs}(ykw[k][i][j])  /  \sqrt{t1[i][j]}$
$\mid$        end
$\mid$         $t3 \leftarrow t3 + (t2[k])^2$
end

** some weights are insignificant to compute the current pattern **

for all $i,j$, such that ($alive[i][j]$) do
begin
$\mid$         $\delta w[i][j] \leftarrow \sqrt{\epsilon}  /  (\sqrt{t1[i][j]} \sqrt{t3})$
$\mid$        if ( $\delta w[i][j] < \delta_{min}$ ) do
$\mid$        begin
$\mid$        $\mid$         $\delta_{min} \leftarrow \delta w[i][j]$
$\mid$        end
end

$weights \leftarrow 0$

for all $i,j$, such that ($alive[i][j]$) do
begin
$\mid$        if ( $\delta w[i][j] > K  \delta_{min}$ ) do
$\mid$        begin
$\mid$        $\mid$         $insignificant[i][j] \leftarrow$ TRUE
$\mid$        $\mid$        for all output units $k$ do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$         $t1[i][j] \leftarrow t1[i][j] - (ykw[k][i][j])^2$
$\mid$        $\mid$        end
$\mid$        end
$\mid$        else do
$\mid$        begin
$\mid$        $\mid$         $weights \leftarrow weights + 1$
$\mid$        $\mid$         $wasalive[i][j] \leftarrow $ TRUE
$\mid$        end
end

** update variables after having marked the current pattern's insignificant weights **
$t3 \leftarrow 0$
for all output units $k$ do
begin
$\mid$         $t2[k] \leftarrow 0$
$\mid$        for all $i,j$, such that ($alive[i][j]$ AND NOT $insignificant[i][j]$ ) do
$\mid$        begin
$\mid$        $\mid$         $t2[k] \leftarrow t2[k] + \mbox{abs}(ykw[k][i][j])  /  \sqrt{t1[i][j]}$
$\mid$        end
$\mid$         $t3 \leftarrow t3 + (t2[k])^2$
end

for all output units $k$ do
begin
$\mid$        for all $i,j$, such that ($alive[i][j]$ AND NOT $insignificant[i][j]$ ) do
$\mid$        begin
$\mid$        $\mid$         $t4 \leftarrow 0$
$\mid$        $\mid$        for all output units $m$ do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$         $t4 \leftarrow t4  +  t2[m]   \mbox{sign}(ykw[m][i][j])$
$\mid$        $\mid$        $\mid$                                 $(  \mbox{kron}(m=k)  t1[i][j] - ykw[m][i][j]   ykw[k][i][j]  )   /   (t1[i][j])^{\frac{3}{2}}$
$\mid$        $\mid$        end
$\mid$        $\mid$         $b[k][i][j] \leftarrow ykw[k][i][j]  /  t1[i][j] + weights  t4  /  t3$
$\mid$        end
end

** Forward pass. **

for all output units $k$ do
begin
$\mid$        for all input units $j$ do
$\mid$        begin
$\mid$        $\mid$         $ry[k][j] \leftarrow 0$
$\mid$        end
$\mid$        for $u=2$ to $g$ do
$\mid$        begin
$\mid$        $\mid$        (IF $u\not=g$ THEN: for all units $i$ in layer $u$ ELSE: $i=k$) do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        for all units $j$ in layer $u-1$ do
$\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$        if ($alive[i][j]$ AND NOT $insignificant[i][j]$ ) do
$\mid$        $\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$         $rs[k][i] \leftarrow rs[k][i] + w[i][j]  ry[k][j] + b[k][i][j]  y[j]$
$\mid$        $\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        $\mid$         $ry[k][i] \leftarrow rs[k][i]  f_i'(s[i])$
$\mid$        $\mid$        end
$\mid$        end
end

** 2. backward pass. **

for all output units $k$ do
begin
$\mid$         $rdks[k] \leftarrow rs[k][k]  f_k''(s[k])$
$\mid$        for $u=1$ to $g-1$ do
$\mid$        begin
$\mid$        $\mid$        for all units $j$ in layer $g-u$ do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$         $rdky[j] \leftarrow 0$
$\mid$        $\mid$        $\mid$        (IF $u\not=1$ THEN: for all units $i$ in layer $g-u+1$ ELSE: $i=k$) do
$\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$        if ($alive[i][j]$ AND NOT $insignificant[i][j]$ ) do
$\mid$        $\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$         $rdky[j] \leftarrow rdky[j] + w[i][j]  rdks[i] + b[k][i][j]  ks[k][i]$
$\mid$        $\mid$        $\mid$        $\mid$        $\mid$         $rdw[i][j] \leftarrow rdw[i][j] +  y[j]  rdks[i] + ry[k][j]  ks[k][i]$
$\mid$        $\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        $\mid$         $rdks[j] \leftarrow f_j'(s[j])  rdky[j] + rs[k][j]  f_j''(s[j])  ky[k][j]$
$\mid$        $\mid$        end
$\mid$        end
end

** Normalize $B$'s gradient to the length of $E$'s gradient. **

$lyw \leftarrow 0$
$lrdw \leftarrow 0$

for all $i,j$, such that ($alive[i][j]$ AND NOT $insignificant[i][j]$ ) do
begin
$\mid$         $lyw \leftarrow lyw + (yw[i][j])^2 $
$\mid$         $lrdw \leftarrow lrdw + (rdw[i][j])^2 $
end

$scale = \sqrt{lyw} / \sqrt{lrdw}$ ** $scale  =  \parallel yw \parallel  /  \parallel rdw \parallel$ **

** End of $B$'s gradient computation (phase 2). **

** Weight update. **

for all $i,j$, such that ($alive[i][j]$ AND NOT $insignificant[i][j]$ ) do
begin
$\mid$         $w[i][j] \leftarrow w[i][j] + \alpha  yw[i][j] - \lambda  scale  rdw[i][j]$
end

** Update learning parameters. **

$exemplars \leftarrow exemplars + 1$

if ($exemplars$ mod $epochlength = 0$) do ** ``mod'' is the modulo function **
begin
$\mid$         $epochs \leftarrow epochs + 1$
$\mid$         $E_n \leftarrow E_n  /  epochlength$
$\mid$         $E_a \leftarrow \gamma  E_a + (1-\gamma)  E_n$
$\mid$        ** $\lambda$ update according to Weigend et al.(1991). **
$\mid$        if ( $E_n \leq E_{tol}$ OR $E_n \leq E_o$) do
$\mid$        begin
$\mid$        $\mid$         $\lambda \leftarrow \lambda + \Delta \lambda$
$\mid$        end
$\mid$        else do
$\mid$        begin
$\mid$        $\mid$        if ($E_n \leq E_a$) do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$         $\lambda \leftarrow \lambda - \Delta \lambda$
$\mid$        $\mid$        $\mid$        if ($\lambda < 0$) do
$\mid$        $\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $\mid$         $\lambda \leftarrow 0$
$\mid$        $\mid$        $\mid$        end
$\mid$        $\mid$        end
$\mid$        $\mid$        else do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$         $\lambda \leftarrow \sigma  \lambda$
$\mid$        $\mid$        end
$\mid$        end
$\mid$         $E_o \leftarrow E_n$
$\mid$         $E_n \leftarrow 0$
$\mid$        ** update weights that are alive, **
$\mid$        ** a weight is alive if it was alive (marked by $wasalive[][]$) **
$\mid$        ** for at least one pattern presented during the previous epoch. **
$\mid$        for all $i,j$, such that ($alive[i][j]$) do
$\mid$        begin
$\mid$        $\mid$        if ( $wasalive[i][j] =$ FALSE) do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $alive[i][j] =$ FALSE
$\mid$        $\mid$        end
$\mid$        $\mid$         $wasalive[i][j] =$ FALSE
$\mid$        end
$\mid$        if ($epochs$ mod $100 = 0$ OR $E_a$ $>$ 2.0 $E_{tol}$) do
$\mid$        ** weights are re-animated if the average error is too large; weights are also **
$\mid$        ** re-animated every 100-th epoch, to enable faster reduction of quadratic error **
$\mid$        ** (due to weight changes, some previously dead weight may turn out to deserve **
$\mid$        ** to live again); one may use values other than 100 and 2.0 **
$\mid$        begin
$\mid$        $\mid$        for all $i,j$ such that $w[i][j]$ exists do
$\mid$        $\mid$        begin
$\mid$        $\mid$        $\mid$        $alive[i][j] =$ TRUE
$\mid$        $\mid$        end
$\mid$        end
end

decide whether to stop learning or not

end(while)



Subsections
next up previous
Next: Acknowledgments Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.3.2. FAST MULTIPLICATION BY
Juergen Schmidhuber 2003-02-13


Back to Financial Forecasting page