Below the algorithm in pseudo code
(using fast multiplication as in appendix A.3.2).
Comments are marked by ``**''. *Note:* the pseudo code
was omitted from the version for *Neural Computation*.
*We recommend not to blindly reimplement the algorithm
from the pseudo code, but to make a serious effort to
understand it, by consulting the body of the paper as well.
We believe that this will greatly facilitate proper,
problem-specific use of the algorithm.*

**Notation.**
In what follows, the variable integers and stand for units.

Variables are reserved for output units only.

is the number of layers, where the th layer is the output layer and the
st layer is the input layer.

The current pattern consists of input vector
and target vector (see section 2).

is the component of the input vector corresponding to input unit .

is the component of the output vector corresponding to output unit .

is the real-valued weight on the connection from
unit to unit (see in
section 2).

is the net input of unit (see equation (34) and text thereafter).

is the activation function of unit , is the first
derivative, is the second derivative of (see appendix A.2).

is the activation of the -th unit (see appendix A.2).

is the error of the -th output unit.

is
(see equation (43)).

is
(see equation (44)).

is
(see equation (45)).

is
,
the gradient of the quadratic error.

abs denotes the absolute value of real .

kron returns if and otherwise.

sign returns the sign of real .

are variables (used to compute the right hand side of equation (40)).

.

(see
the sums over in (40)).

(see the
denumerator in the second line
of equation (40)).

(see the numerator in the second line
of equation (40)).

stands for the number of weights that make a
significant contribution to the computation of the current pattern.

is an approximation of 's precision
(approximation because is unknown).

marks whether or not provides
a significant contribution to the computation of the current pattern.

is
(see equation (40)).

is (see equation (46)).

is (see equation (47)).

is
(see equation (49)).

is
(see equation (48)).

is
,
the gradient
of the additional error term (see equation (50)).

is the current pattern's quadratic error (see section 2 and appendix A.1).

is the learning rate for the quadratic error.

is the learning rate for the additional
error term (the following values are used
to make updates according to Weigend et al., 1991,
see also section 5.6).

is a parameter needed for updating .

is a parameter needed for updating , typical
value is 0.5.

is the most recent epoch's average error.

is the current epoch's average error.

is the exponentially weighted average epoch error.

is the parameter for the exponentially weighted error --
a typical value is 0.9 or 0.99, depending on training set size.

is the tolerable error level - it depends on the task
to be solved (see section 2 and appendix A.1, equation (13)).

is the number of exemplars observed during training.

is the length of an epoch, measured in number of
presented training patterns.

is the current number of
epochs so far, where represents integer division.

are variables required for normalizing
' gradient to the length of 's gradient.

is TRUE if
was alive for at least one pattern
presented during
the previous epoch, and FALSE otherwise.

is TRUE if
is TRUE
*and* if was
always TRUE during all epochs since
was set alive for the last time
(otherwise is FALSE).

denotes the assignment operator.

**Additional comments.**

- For simplicity, the description of the algorithm
neglects bias weights and ``true units''.
- Targets should be scaled
to bound first
order derivatives of output units - see text
after equation (36) in A.2 (e.g.,
for sigmoids active in scale targets to
range ).
- Removing more than one weight (
FALSE)
at a time may cause the error to increase.
Removing only one weight at a time
leads to smoother performance improvement.
- To prevent accidental weight removal in case of
small training sets,
we recommend not to use too many near-zero inputs --
weights from such inputs may be evaluated
despite being significant.
- Likewise,
the random weight initialization in the beginning of
the learning phase may
cause accidental weight removal due to small, random, initial
derivatives. This can be prevented by
keeping all weights alive for a certain
initial time interval.
- Initially,
's value does not yet have a sensible interpretation.
One may start with a large
and decrease it as the error decreases.
- For each pattern, there is a minimal
(stored in ).
represents
weight precision required for significant weights.

*Speeding up the algorithm.*
It makes sense to
separate the algorithm into two phases.
Phase 1 is conventional backprop,
phase 2 is FMS. The backprop phase consists of the
forward pass, the first backward pass, and the weight update
based on (marked in the algorithm).
Start with phase 1. Switch to phase 2 if
. Switch again to phase 1 if
(the values and can be
changed).

Two-phase learning does not sensitively depend on (but avoid values that are always too small). Two-phase learning is justified because weights with large (and small ) hardly influence 's gradient -- it makes sense to let FMS focus on low-precision weights, and let backprop focus on the others.

**ALGORITHM.**

**Initialization.**
Set or (the exponent is the difference
between the numbers of
significant digits required for maximal and minimal precision).

Set to an arbitrary small value.

Set and to 0.

Initialize for all .

Initialize
(typically, ),
and provide a criterion for stopping the learning procedure.

Set
, for instance,
or
or
.

Set to some desired error value after learning.

Set
to 0.

Set .

Set
to 0.

Set TRUE for all .

Set
FALSE for all .

Set to a large value.

Set FALSE for non-existing .

While training not stopped do

begin(while)

select pattern pair .

** The following variables can be set after they were
used for the last time in the previous loop. **

set all components of
to 0

set all components of
to FALSE

set to 0

** **Forward pass.** **

for all input units do

begin

end

for to do

begin

for all units in layer do

begin

for all units in layer do

begin

if () do

begin

end

end

end

end

** **Compute the error.** **

for all output units do

begin

end

** Compute the error gradient **

** **1. backward pass.** **

for all output units do

begin

for to do

begin

for all units in layer do

begin

(IF THEN: for all units in layer ELSE: ) do

begin

if () do

begin

set abs
1E-5 ** to avoid division overflow **

end

end

end

end

end

** **End of conventional backprop (phase 1).** **

** **compute
** **

** we recommend to
introduce additional local variables for inner loops, such as
and
**

for all output units do

begin

for all , such that () do

begin

end

end

** some weights are insignificant to compute the current pattern **

for all , such that () do

begin

if (
) do

begin

end

end

for all , such that () do

begin

if (
) do

begin

TRUE

for all output units do

begin

end

end

else do

begin

TRUE

end

end

** update variables after having marked the current
pattern's insignificant weights **

for all output units do

begin

for all , such that ( AND NOT
) do

begin

end

end

for all output units do

begin

for all , such that ( AND NOT
) do

begin

for all output units do

begin

end

end

end

** **Forward pass.** **

for all output units do

begin

for all input units do

begin

end

for to do

begin

(IF THEN: for all units in layer ELSE: ) do

begin

for all units in layer do

begin

if ( AND NOT
) do

begin

end

end

end

end

end

** **2. backward pass.** **

for all output units do

begin

for to do

begin

for all units in layer do

begin

(IF THEN: for all units in layer ELSE: ) do

begin

if ( AND NOT
) do

begin

end

end

end

end

end

** **Normalize 's gradient to the
length of 's gradient.** **

for all , such that ( AND NOT
) do

begin

end

**
**

** **End of 's gradient computation (phase 2).** **

** **Weight update.** **

for all , such that ( AND NOT
) do

begin

end

** **Update learning parameters.** **

if ( mod
) do ** ``mod'' is the modulo function **

begin

** update according to Weigend et al.(1991). **

if (
OR ) do

begin

end

else do

begin

if () do

begin

if () do

begin

end

end

else do

begin

end

end

** update weights that are alive, **

** a weight is alive if it was alive (marked by ) **

** for at least one pattern presented during the previous epoch. **

for all , such that () do

begin

if (
FALSE) do

begin

FALSE

end

FALSE

end

if ( mod OR 2.0 ) do

** weights are re-animated if the average error is too large; weights are also **

** re-animated every 100-th epoch, to enable faster reduction of quadratic error **

** (due to weight changes, some previously dead weight may turn out to deserve **

** to live again); one may use values other than 100 and 2.0 **

begin

for all such that exists do

begin

TRUE

end

end

end

decide whether to stop learning or not

end(while)

Back to Financial Forecasting page